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Abstract 

Suppose that we observe y € R-^ and X G R/xm 

in the following errors-in-variables model: 

y = Aor+e 
A = Xo + W 

where Aq is a / x m design matrix with independent subgaussian row vectors, e G R-^ is a noise 
vector and IT is a mean zero f x m random noise matrix with independent subgaussian column vectors, 
independent of Aq and e. This model is significantly different from those analyzed in the literature in 
the sense that we allow the measurement error for each covariate to be a dependent vector across its 
/ observations. Such error structures appear in the science literature when modeling the trial-to-trial 
fluctuations in response strength shared across a set of neurons. 

Under sparsity and restrictive eigenvalue type of conditions, we show that one is able to recover a 
sparse vector f3* G R™ from the model given a single observation matrix A and the response vector 
y. We establish consistency in estimating P* and obtain the rates of convergence in the ig norm, where 
(7=1,2 for the Lasso-type estimator, and for q G [1,2] for a Dantzig-type conic programming estimator. 
We show error bounds which approach that of the regular Lasso and the Dantzig selector in case the errors 
in W are tending to 0. 


1 Introduction 


The matrix variate normal model has a long history in psychology and social sciences, and is becoming 
increasingly popular in biology and genomics, neuroscience, econometric theory, image and signal pro¬ 
cessing, wireless communication, and machine learning in recent years, see for example Dawid (1981); 
Gupta and Varga (1992); Dutilleul (1999); Werner et al. (2008); Bonilla et al. (2008); Yu et al. (2009); Efron 
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(2009); Allen and Tibshirani (2010); Kalaitzis et al. (2013), and the references therein. We call the random 
matrix X which contains / rows and m columns a single data matrix, or one instance from the matrix variate 
normal distribution. We say that an / x m random matrix X follows a matrix normal distribution with a 
separable covariance matrix Sx = A iSi B, which we write Xjxm ~ ® ^/x/)- This is 

equivalent to say vec { X } follows a multivariate normal distribution with mean vec { M } and covariance 
Ex = A® B. Here, vec { X } is formed by stacking the columns of X into a vector in R™-^. Intuitively, 
A describes the covariance between columns of X while B describes the covariance between rows of X. 
See Dawid (1981); Gupta and Varga (1992) for more characterization and examples. 

In this paper, we introduce the related Kr'onecker Sum models to encode the covariance structure of a matrix 
variate distribution. The proposed models and methods incorporate ideas from recent advances in graph¬ 
ical models, high-dimensional regression model with observation errors, and matrix decomposition. Let 
Amxm, Bfxf be symmetric positive definite covariance matrices. Denote the Kr'onecker sum of A = (ajj) 
and B = [bij) by 


S 


A® B := A® If + Im® 
CLiiIf + B ai2lf 
a2llf 0,221 f + B 


B 


0\mlf 
02mlf 


Omllf 


Om2lf 


Ommlf T B 


(m/)x(m/) 


where If is an f x f identity matrix. This covariance model arises naturally from the context of errors-in- 
variables regression model defined as follows. Suppose fhaf we observe y G R-^ and X G in fhe 

following model: 


y = Xor + e (la) 

X = Xo + VL (lb) 


where Xq is a / x m design mafrix wifh independenf row vecfors, e G R'^ is a noise vecfor and W is a mean 
zero f X m random noise mafrix, independenf of Xq and e, wifh independenf column vecfors 
In particular, we are interested in fhe addifive model of X = Xq -|- IV such fhaf 


vec { X } ~ AA(0, S) where S = A 0 B := A (g) If -|- Im (g) B (2) 

where we use one covariance componenf A® If to describe fhe covariance of mafrix Xq G which 

is considered as fhe signal mafrix, and fhe ofher componenf 1^ ® B to describe fhaf of fhe noise matrix 
W G where Kco^ = B for all j, where denotes fhe column vecfor of W. Our focus is on 

deriving fhe sfafisfical properties of fwo esfimafors for esfimafing /?* in (la) and (lb) despife fhe presence 
of fhe addifive error W in fhe observation mafrix X. We will show fhaf our fheory and analysis works wifh 
a model much more general fhan fhaf in (2), which we will define in Section 1.1. 

Before we go on fo define our esfimafors, we now use an example fo mofiviafe (2) and ifs subgaussian 
generalization in Definition 1.2. Suppose fhaf fhere are / pafienfs in a particular sfudy, for which we use 
Xq fo model fhe ’’sysfolic blood pressure” and W fo model fhe seasonal effecfs. In fhis case, X models fhe 
facf fhaf among fhe / pafienfs we measure, each pafienf has ifs own row vecfor of observed sef of blood 
pressures across time, and each column vecfor in W models fhe seasonal variafion on fop of fhe frue signal 
al a particular day/lime. Thus we consider X as measuremenl of Xq wifh W being fhe observafion error. 
Thai is, we model fhe seasonal effecls on blood pressures across a sef of pafienfs in a parlicular sfudy wifh 
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a vector of dependent entries. Thus W is a matrix which consists of repeated independent sampling of 
spatially dependent vectors, if we regard the individuals as having spatial coordinates, for example, through 
their geographic locations. We will come back to discuss this example in Section 1.3. 

1.1 The model and the method 

We first need to define an independent isotropic vector with subgaussian marginals as in Definition 1.1. 
Definition 1.1. Let Y be a random vector in RP 

1. Y is called isotropic if for every y G R^, E {Y,y) = ||y|| 2 - 

2. Y is V ’2 with a constant a if for every y G R^, 

\\{Y^y)\\y, 2 -= inf{f : E (exp( (y,?/) < 2} < aWyW^. (3) 

The '02 condition on a scalar random variable V is equivalent to the subgaussian tail decay of V, which 
means "P {\V\ > t) < 2exp(— for all t > 0. 

Throughout this paper, we use 02 vector, a vector with subgaussian marginals and subgaussian vector inter¬ 
changeably. 

Definition 1.2. Let Z be an f x m random matrix with independent entries Zij satisfying PZij = 0, 
1 = PZfj < \\Zij\\^^ < K. Let Zi, Z 2 be independent copies of Z. Let X = Xq + W such that 

1. Xq = is the design matrix with independent subgaussian row vectors, and 

2. iy = RV2^2 is a random noise matrix with independent subgaussian column vectors. 

Assumption (Al) allows the covariance model in (2) and its subgaussian variant in Definition 1.2 to be 
identifiable. 

(Al) We assume tr(A) = m is a known paramefer, where tr(A) denotes the trace of matrix A. 

In the kronecker sum model, we could assume we know tr(i?), in order not to assume knowing tr(A). 
Assuming one or the other is known is unavoidable as the covariance model is not identifiable otherwise. 
Moreover, by knowing tr(A), we can construct an estimator for tr(R): 

“/tr(^))+ and define re := jtr(.B) > 0 (4) 

where (a)+ = a V 0. We first introduce the Lasso-type estimator, adapted from those as considered 
in Loh and Wainwright (2012). 

Suppose that tr{B) is an estimator for tr(i?)//; for example, as constructed in (4). Let 

f = jX^X - jtr:{B)Lm and 7 = (5) 

For a chosen penalization parameter A > 0, and parameters bo and d, we consider the following regularized 
estimation with the .^i-norm penalty, 

0 = argmin i/3'^r/3 - (7,/3) -hA||/3||i, (6) 
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which is a vaiiation of the Lasso Tibshirani (1996) or the Basis Pursuit Chen et al. (1998) estimator. Al¬ 
though in our analysis, we set bo>m\2 and d = \ supp (/3*) | for simplicity. In practice, both bo and d are 
understood to be parameters chosen to provide an upper bound on the ^2 norm and the sparsity of the true 

/3L 

Recently, Belloni et al. (2014) discussed the following conic programming compensated matrix uncertainly 
(MU) selector , which is a variant of the Dantzig selector Candes and Tao (2007); Rosenbaum and Tsybakov 
(2010, 2013). Adapted to our setting, it is defined as follows. Let A, /r, r > 0, 

13 = argminj II/3II 
T = [{(3,t) : (3g 

where 7 and L are as defined in (5) wifh /r ~ 
programming esfimafor from now on. 


^ + Af : (/3,f) G T} where 


(V) 


R" 


7-r/3 


< /if+ r, II/ 3 II 2 < f| 


log m 


, r ~ 


log m 


. We refer fo fhis esfimafor as fhe Conic 


1.2 Our contributions 

We provide a unified analysis of fhe rafes of convergence for bofh fhe Lasso-fype esfimafor (6) as well as 
fhe Conic Programming esfimafor (7), which is a Danfzig selecfor-fype, alfhough under slighfly differenf 
conditions. We will show fhe rales of convergence in fhe £q norm for g = 1, 2 for eslimaling a sparse vector 
[3* G R"^ in fhe model (la) and (lb) using fhe Lasso-fype esfimafor (6) in Theorems 2 and 4, and fhe Conic 
Programming esfimafor (7) in Theorems 3 and 5 for 1 < 9 < 2. For fhe Conic Programming esfimafor, we 
also show bounds on fhe predictive errors. The bounds we derive in bofh Theorems 2 and 3 focus on cases 
where fhe errors in W are nof foo small in fheir magniludes in fhe sense fhaf tb ■= tr(R)// is bounded 
from below. For fhe exfreme case when tb approaches 0, one hopes fo recover bounds close fo fhose for 
fhe regular Lasso or fhe Danfzig selector as fhe effecl of fhe noise in malrix W on fhe procedure becomes 
negligible. We show in Theorems 4 and 5 lhal Ibis is indeed fhe case. These resulls are new fo fhe besl of 
our knowledge. 

In Theorems 2 fo 5, we consider fhe regression model in (la) and (lb) wifh subgaussian random design, 
where Xo = is a subgaussian random malrix wifh independenl row vectors, and W = is a 

fxm random noise malrix wifh independenl column veclors where Zi , Z 2 are independenl subgaussian ran¬ 
dom malrices wifh independenl enfries (cf. Definilion 1 .2). This model is significanfly differenf from fhose 
analyzed in fhe lileralure. For example, unlike fhe presenf work, fhe aulhors in Loh and Wainwrighl (20 1 2) 
apply Theorem 8 which sfales a general resull on sfafislical convergence properlies of fhe esfimafor (6) to 
cases where W is composed of independenl subgaussian row vectors, when fhe row vectors of Xq are eifher 
independenl or follow a Gaussian vector aulo-regressive model. See also Rosenbaum and Tsybakov (2010, 
2013); Chen and Caramanis (2013); Belloni el al. (2014) for fhe corresponding results on the compensated 
MU selectors, variant on the Orthogonal Matching Pursuit algorithm and the Conic Programming estimator 
(V). 

The second key difference between our framework and the existing work is that we assume that only one 
observation matrix X with the single measurement error matrix W is available. Assuming (Al) allows us to 
estimate IKW'^W as required in the estimation procedure (5) directly, given the knowledge that W is com¬ 
posed of independent column vectors. In contrast, existing work needs to assume that the covariance matrix 
'^w '■= jElU^VF of the independent row vectors of W or its functionals are either known a priori, or can 
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be estimated from an dataset independent of X, or from replicated X measuring the same Xq\ see for exam¬ 
ple Rosenbaum and Tsybakov (2010,2013); Belloni et al. (2014); Lob and Wainwright (2012); Carroll et al. 
(2006). Such repeated measurements are not always available or are costly to obtain in practice Carroll et al. 
(2006). 


A noticeable exception is the work of Chen and Caramanis (2013), which deals with the scenario when 
the noise covariance is not assumed to be known. We now elaborate on their result, which is a variant 
of the orthogonal matching pursuit (OMP) algorithm Tropp (2004); Tropp and Gilbert (2007). Their sup¬ 
port recovery result, that is, recovering the support set of /?*, applies only to the case when both signal 
matrix and the measurement error matrix have isotropic subgaussian row vectors; that is, they assume 
independence among both rows and columns in X (Aq and VP); moreover, their algorithm requires the 
knowledge of the sparsity parameter d, which is the number of non-zero entries in /3*, as well as a /3min con¬ 


dition: min 


-jesupp p* 


/3- 


= Q 


log m 

~T~ 


-V 1)). They recover essentially the same ^ 2 -ciTor bounds as 


in Loh and Wainwright (2012) and the current work when the covariance is known. 


In summary, oblivion in Spy and a general dependency condition in the data matrix X are not simultaneously 
allowed in existing work. In contrast, while we assume that Xq is composed of independent subgaussian 
row vectors, we allow rows of W to be dependent, which brings dependency to the row vectors of the 
observation matrix X. In the current paper, we focus on the proof-of-the-concept on using the kronecker 
sum covariance and additive model to model two way dependency in data matrix X, and derive bounds 
in statistical convergence for (6) and (7). In some sense, we are considering a parsimonious model for 
fitting observation data with two-way dependencies; that is, we use the signal matrix to encode column¬ 
wise dependency among covariates in X, and error matrix W to explain its row-wise dependency. When 
replicates of A or VP are available, we are able to study more sophisticated models and inference problems 
to be described in Section 1.3. 


1.3 Discussion 

The key modeling question is: would each row vector in VP for a particular patient across all time points be 
a correlated normal or subgaussian vector as well? It is our conjecture that combining the newly developed 
techniques, namely, the concentration of measure inequalities we have derived in the current framework 
with techniques from existing work, we can handle the case when VP follows a matrix normal distribution 
with a separable covariance matrix = C ® B, where C is an m x m positive semi-definite covariance 
matrix. Moreover, for this type of ’’seasonal effects” as the measurement errors, the time varying covariance 
model would make more sense to model VP, which we elaborate in the second example. 

As a second example, in neuroscience applications, population coding refers to the information contained in 
the combined activity of multiple neurons Kass et al. (2005). The relationship between population encod¬ 
ing and correlations is complicated and is an area of active investigation, see for example Ruff and Cohen 
(2014); Cohen and Kohn (2011) It becomes more often that repeated measurements (trials) simultaneously 
recorded across a set of neurons and over an ensemble of stimuli are available. In this context, one can 
imagine using a random matrix Aq ~ Ay A (g) 77) which follows a matrix-variate normal distribution, 
or its subgaussian correspondent, to model the ensemble of mean response variables, e.g., the membrane 
potential, corresponding to the cross-trial average over a set of experiments. Here we use A to model the 
task correlations and B to model the baseline correlation structure among all pairs of neurons at the sig¬ 
nal level. It has been observed that the onset of stimulus and task events not only change the cross-trial 
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mean response in ix, but also alter the structure and correlation of the noise for a set of neurons, which 
correspond to the trial-to-trial fluctuations of the neuron responses. We use W to model such task-specific 
trial-to-trial fluctuations of a set of neurons recorded over the time-course of a variety of tasks. Models 
as in (la) and (lb) are useful in predicting the response of set of neurons based on the current and past 
mean responses of all neurons. Moreover, we could incorporate non-i.i.d. non-Gaussian W = [tui,..., Wm] 
where wt = where z{l ),... , z{m) are independent isotropic subgaussian random vectors and 

B{t) 0 for all t, to model the time-varying correlated noise as observed in the trial-to-trial fluctuations. 

It is possible to combine the techniques developed in the present paper with those in Zhou et al. (2010); 
Zhou (2014) to develop estimators for A, B and the time varying B{t) which is itself an interesting topic, 
however, beyond the scope of the current work. 

We leave the investigation of this more general modeling framework and relevant statistical questions to 
future work. We refer to Carroll et al. (2006) for an excellent survey of the classical as well as modern 
developments in measurement error models. In future work, we will also extend the estimation methods 
to the settings where the covariates are measured with multiplicative errors which are shown to be re¬ 
ducible to the additive error problem as studied in the present work; see Rosenbaum and Tsybakov (2013); 
Loh and Wainwright (2012). Moreover, we are interested in applying the analysis and concentration of mea¬ 
sure results developed in the current paper and in our ongoing work to the more general contexts and settings 
where measurement error models are introduced and investigated; see for example Dempster et al. (1977); 
Carroll et al. (1985); Stefanski (1985); Hwang (1986); Fuller (1987); Stefanski (1990); Carroll and Wand 
(1991); Carroll et al. (1993); Cook and Stefanski (1994); Stefanski and Cook (1995); Iturria et al. (1999); 
Liang et al. (1999); Strimmer (2003); Xu and You (2007); Hall and Ma (2007); Liang and Li (2009); Ma and Li 
(2010); Allen and Tibshirani (2010); Stadler et al. (2014); Spresen et al. (2014b,a) and the references therein. 


2 Assumptions and preliminary results 

We will now define some parameters related to the restricted and sparse eigenvalue conditions that are 
needed to state our main results. We also state a preliminary result in Lemma 1 regarding the relationships 
between the two conditions in Definitions 2.1 and 2.2. 

Definition 2.1. (Restricted eigenvalue condition RE(so, ko, A)). Let I < sq < p, and let Uq be a positive 
number. We say that a p x q matrix A satisfies RE(so, ko, A) condition with parameter K(so, ko,A) if for 
any r; / 0, 

1 \\Av\L 

— -;-- := min min ^ > 0. (8) 

K{so,ko,A) JC{i.....p},||t;jc||,<fco||t;j||, ||r;j ||2 

I J|<S0 

It is clear that when sq and /cq become smaller, this condition is easier to satisfy. We 
following variation of the baseline RE condition. 

Definition 2.2. (Lower-RE condition) Loh and Wainwright (2012) The matrix F satisfies 
dition with curx’ature a > 0 and tolerance t > 0 if 

e'^Te>a\\e\\l-T\\e\\l v^gr”^. 

As a becomes smaller, or as r becomes larger, the Lower-RE condition is easier to be satisfied. 


also consider the 
a Lower-RE con- 
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Lemma 1. Suppose that the Lower-RE condition holds for F := A with a, r > 0 such that r(l + 
fco)^'So < a/2. Then the RE(so, ko,A) condition holds for A with 


1 ^ n 

K{so,ko,A) - V 2 ^ ■ 

Assume that RE((/co + 1)^, k^, A) holds. Then the Lower-HE condition holds for T = A with 

1 


a = 


(/co + l)iF2(so, fco, A) 


> 0 


where sq = {ko + 1)^, and r > 0 which satisfies 


'^min(r) ^ a TSq/4:. 


(9) 


The condition above holds for any r > ^ko+AKfsoMA) ~ ife+iF' 

The first part of Lemma 1 means that, if ko is fixed, then smaller values of r guarantee RE(so,/cO)^) 
holds with larger sq, that is, a stronger RE condition. The second part of the Lemma implies that a weak 
RE condition implies that the Lower-RE (LRE) holds with a large r. On the other hand, if one assumes 
RE((A;o + 1)^, ko,A) holds with a large value of ko (in other words, a strong RE condition), this would 
imply LRE with a small r. In short, the two conditions are similar but require tweaking the parameters. 
Weaker RE condition implies ERE condition holds with a larger r, and Eower-RE condition with a smaller 
T, that is, stronger ERE implies stronger RE. We prove Eemma 1 in Section 8. 

Definition 2.3. (Upper- RE condition) Loh and Wainwright (2012) The matrix F satisfies an upper-RE con¬ 
dition with curvature a > 0 and tolerance t > 0 if 

e^T9 <a\\9\\l + T\\e\\l V0eR™. 

Definition 2.4. Define the largest and smallest d-sparse eigenvalue of a p x q matrix A to be 


Pmsux{d^ -A) 

:= max 

t^0;d—sparse 

\\Atg/\ 

112 

t 2 ! where d < p, 

(10) 

and pmm{d,A) 

:= min 

t^0;d—sparse 

\\Atg/\ 

lltiP 

FII 2 • 

(11) 


The rest of the paper is organized as follows. In Section 3, we present two main results Theorems 2 and 3. 
We state results which improve upon Theorems 2 and Theorem 3 in Section 4, when the measurement 
errors in W are small in their magnitudes in the sense of tr{B) being small. In Section 5, we outline the 
proof of the main theorems. In particular. In Section 5, we outline the proof for Theorems 2, 3, 4,and 5 in 
Section 5, 5.1, 5.3 and 5.4 respectively. In Section 6, we show a deterministic result as well as its application 
to the random matrix F — A for F as in (5) with regards to the upper and Eower RE conditions. In section 7, 
we show the concentration properties of the gram matrices XX'^ and X^X after we correct them with the 
corresponding population error terms defined by tr{A)If and tr(i7)/m respectively. These results might be 
of independent interests. The technical details of the proof are collected at the end of the paper. We prove 
Theorem 2 in Section 9. We prove Theorem 3 in Section 10. We prove Theorem 4 and 5 in Section 11 
and Section 12 respectively. The paper concludes with a discussion of the results in Section 13. Additional 
proofs and theoretical results are collected in the Appendix. 
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Notation. Let ei, ..., Cp be the canonical basis of R^. For a set J C {1,. .. ,p}, denote Ej = span{ej : 
j G J}. For a matrix A, we use ||^||2 to denote its operator norm. For a set F C R^, we let conv V denote 
the convex hull of V. For a finite set Y, the cardinality is denoted by |y|. Let Bf, and be the 
unit ball, the unit Euclidean ball and the unit sphere respectively. For a matrix A = let 

ll^llmax ~ rnaxjj- \aij\ denote the entry-wise max norm. Let ||A||^ = maxj denote the matrix 

li norm. The Frobenius norm is given by ||^|||' = Yhi Yhj Let 1^41 denote the determinant and tr(^) be 
the trace of A. Let Amax(^) and Amin(^) be the largest and smallest eigenvalues, and k{A) be the condition 
number for matrix A. The operator or £2 norm ||^||2 is given by Amax(^^^)- 

For a matrix A, denote by r{A) the effective rank tr(^)/ ||^|| 2 - Let ||^||^/||^||2 denote the stable rank for 
matrix A. We write diag(74) for a diagonal matrix with the same diagonal as A. For a symmetric matrix A, 
let T(^) = {vij) where Vij = I{aij / 0), where I(-) is the indicator function. Let I be the identity matrix. 
We let C be a constant which may change from line to line. For two numbers a,b, a A b := min(a, b) and 
a W b := max(a,6). We write a x 6 if ca < 6 < Ca for some positive absolute constants c,C which 
are independent of n, /, m or sparsity parameters. Let (a)+ := o V 0. We write a = 0{b) if a < Cb for 
some positive absolute constants C which are independent of n, /, m or sparsity parameters. These absolute 
constants C,Ci,c,ci,... may change line by line. 

3 Main results 

In this section, we will state our main results in Theorems 2 and 3 where we consider the regression model 
in (la) and (lb) with random matrices ACq, W G as defined in Definition 1.2. 

For fhe Lasso-fype esfimafor, we are interested in fhe case where fhe smallesf eigenvalue of fhe column-wise 
covariance mafrix A does nol approach 0 foo quickly and fhe effecfive rank of fhe row-wise covariance 
mafrix B is bounded from below (cf. (14)). For fhe Conic Programming esfimafor, we impose a resfricfed 
eigenvalue condifion as formulated in Bickel el al. (2009); Rudelson and Zhou (2013) on A and assume fhaf 
fhe sparsify of {3* is bounded by o(y^// log m). These conditions will be relaxed in Section 4 where we 
allow tb to approach 0. 

Before stating our main result for the Lasso-type estimator in Theorem 2, we need to introduce some more 
notation and assumptions. Let Omax = max* an and 6max = maxj bn be the maximum diagonal entries of 
A and B respectively. In general, under (Al), one can think of Amin(^) < 1 and for s > 1, 


^ Pmax('S)A) ^ ■^max (Al), 


1 < a. 


'max 


where Amax(A) denotes the maximum eigenvalue of A. 

(A2) The minimal eigenvalue Aniin(^) of the covariance matrix A is bounded: 1 > Amin(^) > 0. 


that the condition number k(A) is upper bounded by O ( y logm j 


(A3) Moreover, we assume 

0(Ar„ax(A)). 


Throughout the rest of the paper, sq A 1 is understood to be the largest integer chosen such that the following 
inequality still holds: 



( 12 ) 








where we denote by tb = tv{B)/f and C is to be defined. Denote by 


Ma 


QACw{sq) 


> 64C. 


Throughout this paper, for the Lasso-type estimator, we will use the expression 


( 13 ) 


o. 

T ■= —, where a = Amin(A)/2; 

■So 

(A2) thus ensures that the Lower-RE condition as in Definition 2.2 is not vacuous. (A3) ensures that (12) 
holds for some sq > 1. 

Theorem 2. (Estimation for the Lasso-type estimator) Set 1 < / < m. Suppose m is sufficiently 
large. Suppose (Al), (A2) and (A3) hold. Consider the regression model in (la) and (lb) with independent 
random matrices Xo,W as in Definition 1.2, and an error vector e £ R-^ independent of Xq,W, with 
independent entries Cj satisfying Kcj = 0 and C'o,c' > G be some absolute constants. 

Let D 2 := 2(||A||2 + ||f?|| 2 )- Suppose that ||i3||^ / \\B\\^ > log m. Suppose that d< 1 and 


r{B) := 


HB) 

llRlIo 


> IQcK^ 


logm 


log 


log m 
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(14) 


where V is a constant which depends on Amin(^). Pmax(sO) cind ti(B)/f. 
Let bQ,(j) be numbers which satisfy 


m2 


<4><i. 


Assume that the sparsity of satisfies for some 0 < f < 1 


d := |supp(/3*)l < 


c'fK^ f 

128M^ logm 


<// 2 . 


(15) 


(16) 


Let j3 be an optimal solution to the Lasso-type estimator as in ( 6 ) with 


A > A'ljjA 


I logm 


/ 


where := C 0 D 2 K (K + M,) 


(17) 


Then for any d-sparse vectors fi* G R™, such that fibQ < ||/ 3*||2 < bQ, we have with probability at least 
1 — 16/m^, 


/3-/3- 


20 r- 

< —Av d and 


a 


/3-/3- 


XAxi. 

1 a 


We give an outline of the proof of Theorem 2 in Section 5.1. We prove Theorem 2 in Section 9. 
Discussions. Denote the Signal-to-noise ratio by 

S/N := A:2 ||/3*||2/m 2 where S := and N := M^. 


The two conditions on 60 , 0 imply that N < 08. Notice that this could be restrictive if f is small. 
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We will show in Section 5.1 that condition (15) is not needed in order for the error bounds in terms of the 
£p,p = 1, 2 norm of ^5 — /3*, as shown in the Theorem 2 statement to hold. It was indeed introduced so 
as to simplify the expression for the condition on d as shown in (16). There we provide a slightly more 
general condition on d in (41), where (15) is not required. In summary, we prove that Theorem 2 holds with 
N = Mg and S = (pK'^bQ in arbitrary orders, so long as condition ( 14) holds and 

d = o( I ^ 

\ log m 



For both cases, we require that A x (|| 2 l ||2 + ||i?|| 2 ).ff\/S + Ny as expressed in (17). That is, when 
either the noise level Mg or the signal strength 76 ||/3*|| increases, we need to increase A correspondingly; 
moreover, when N dominates the signal ||/3* II 2 , we have for d x \ogrn ’ 


P-fd* 


2 


/ini2 


< 


— D2K^ 
a 



-D2K^ 


[N 1 

V S w{so) 


which eventually becomes a vacuous bound when N S> S. We will present an improved bound in The¬ 
orem 4. We further elaborate on the relationships among the noise, the measurement error and the signal 
strength in Section 4.2. 

Theorem 3. Suppose (Al) holds. Set 0 < <5 < 1. Suppose that f < m exp(/) and 1 < do < /. 
Let X > 0 be the same parameter as in (7). Assume that RE(2(io;3(l + A),^^/^) holds. Suppose that 
\\B\\l/\\B\\l > log m. Suppose that the sparsity of fd* is bounded by 


do := |supp(/3*)| < coy///logm 


(18) 


for some constant cq > 0,' Suppose Aiq := 1 + A 

2000 ^ 76 -^ 


/ > 


<52 


d — 2(io T 2(7o®max 


, f 60em\ 

16K^{2do,3ko,A^/^){3ko)^3ko + 1 ) 


<52 


(19) 

( 20 ) 


Consider the regression model in (la) and (lb) with Xq, W as in Definition 1.2 and an error vector e G R^, 
independent of Xq,W, with independent entries ej satisfying Ee^ = 0 and 

optimal solution to the Conic Programming estimator as in (7) with input ( 7 , F) as defined in (5), where 
tr(77) is as defined in (4). Choose for D 2 = 2 (||A ||2 + || 7 ?|| 2 ) and Dq = ^/tb + V“max. 


and 

Then with probability at least 1 — ^ — 2 exp(—(i2//200076'^), for 2 > q > 1, 

' Mg 


Id-fd^ 




I2 + 


K 


( 21 ) 
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Under the same assumptions, the predictive risk admits the following bounds with the same probability as 
above, 


X(^-I3*) <C'DlK^do 


logm 


+ 


K 


where c',C q,C,C' > 0 are some absolute constants. 

We give an outline of the proof of Theorem 3 in Section 5 while leaving the detailed proof in Section 10. 

Discussions. Similar results have been derived in Loh and Wainwright (2012); Belloni et al. (2014), how¬ 
ever, under different assumptions on the distribution of the noise matrix W. When W is, & random matrix 
with i.i.d. subgaussian noise, our results will essentially recover the results in Loh and Wainwright (2012) 
and Belloni et al. (2014). The choice of A for the Lasso estimator and parameters /r, r for the DS-type 
estimator satisfy 


^ - ft 11/3*112 + 'r 

This relationship is made clear thr'ough Theorem 8 regarding the Lasso-type estimator, which follows from 
Theorem 1 Loh and Wainwright (2012), Lemmas 6, 11, 14, and 16, which are the key results in proving 
Theorems 2, 3, 4, and 5. Finally, we note that following Theorem 2 as in Belloni et al. (2014), one can show 
that without the relatively restrictive sparsity condition (18), a bound similar to that in (21) holds, however 
with ||/?*||2 being replaced by ||/3*||]^, so long as the sample size satisfies the requirement as in (27). 

4 Improved bounds when the measurement errors are small 

Throughout our analysis of Theorems 2 and 3, we focused on the case when the errors in W are sufficiently 
large in the sense that tb = tr{B)/f > 0 is bounded from below; for example, this is explicitly indicated 
by the lower bound on the effective rank r{B) = ti{B)/ ||77||2, when ||i ?||2 is bounded away from 0. More 
precisely, by the condition on the effective rank as in (14), we have 

= > l6c'A-‘|ilogL^^ where V = 3eJVq/2. 

/ log m / 

The bounds we derive in this section focus on cases where the measurement errors in W are small in their 
magnitudes in the sense of tb being small. For the extreme case when tb approaches 0, one hopes to recover 
a bound close to the regular Lasso or the Dantzig selector as the effect of the noise on the procedure should 
become negligible. We show in Theorems 4 and 5 that this is indeed the case. First, we define some confants 
which we use fhroughout fhe rest of the paper. Denote by 

and r+ := (22) 

= 2{\\A\\l/^ + \\B\\l/^). (23) 


Do — + OmiL) Dq — ||i7||2'^^ + 

where := y/r^ + and Dorade 

y/m 

We first state a more refined result for the Lasso-type estimator. 
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Theorem 4. Suppose all conditions in Theorem 2 hold, except that we drop (15) and replace (17) with 


where -iP := 2 CoD’qK K \\^*\\^ + M.'j 


(24) 


Suppose that for 0 < f < 1 and Ca '■= 


d := |supp(;8*)| < Ca 


f 


logm 


,r-, 11 11 2 “I" '^max 7-1 ^ r-i 

W •= - for = 


{c'C' 0 A 2 } where 






(25) 


D = Pmax('SO) rf) + tb, and d ,(p,bQ, Ivf and K as defined in Theorem 2. 

Then for any d-sparse vectors (3* G R™, such that fbQ < ||/ 3*||2 < have with probability at least 

1 — IQ/vnf, 


/3-/3- 


20 , ^ 

< —Av d and 
2 a 




80 , , 
< — Xd. 
1 a 


We give an outline for the proof of Theorem 4 in Section 5.3, and show the actual proof in Section 11. 
Remark 4.1. Let us redefine the SignaTto-noise ratio by 


S/M := 
S := 


t^K^\\/3*\\1 + M^ 
K‘^\\(3*\\l and M: 


where 

= M 2 +r+iT 2 ||^* 


||2 

II 2 


V/e now only require that A x (omax + \ . That is, when either the noise level or the 

I /2 

measurement error strength in terms of ||/ 3*||2 increases, we need to increase the penalty parameter 

A correspondingly; moreover, when d x lolm 




< —DqK^ 
a 


SMa^ ° V Szz7(so)’ 


which eventually becomes a vacuous bound when S. 


4.1 A Corollary for Theorem 3 

We next state in Theorem 5 an improved bound for the Conic programming estimator (7), which improves 
upon Theorem 3 when tb is small. 

Theorem 5. Suppose all conditions in Theorem 3 hold, except that we replace the condition on d as in (18) 
with the following. Suppose that the sample size f and the size of the support of (3* satisfy the following 
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requirements: for Cq > -Dorade and rm,m = 


do = O r 


/ 


B ' 


logm 


where 


and f > 


2000dK'^ , /60em\ 


+ 2CeKrU% 


<52 


d — 2 (io “ 1 “ 2 (ioOmax 


log ( — 7 -^ 1 where 
\ dS J 

lQK'^{2do,2,ko,A^/^){2,kof{2,ko + 1 ) 


<52 


(26) 

(27) 

(28) 


Let (5 be an optimal solution to the Conic Programming estimator as in (7) with input (7, T) as defined 
in (5), where ii{B) is as defined in (4). Suppose 


T X DoM^Vmj where rmj = CoKx 


I logm 


f 


and 


p X where := + CoKr]ll^. 


Then with probability at least 1 — ^ — 2 exp (—<52 J/2000iir^), for 2 > q > 1, and 
\CoKt\L%) 




< C'D'oK^dl^'^ 

<? 




(29) 

(30) 


(31) 


Under the same assumptions, the predictive risk admits the following bounds 


i 

f 


Xifi-fi*) 


<C"{\\B\\^ + a^^,)KW 



with the same probability as above, where cf, C, C" > 0 are some absolute constants, and x 2tb + 

0/^2 t^2 
oUg i\ 1 7 n,m* 


4.2 Discussions 

In particular, when tb 0, Theorem 5 allows us to recover a rate close that of the Dantzig selector with 
an exact recovery if rs = 0 is known a priori; see Section 13. Moreover the constraint (18) on the sparsity 
parameter do appearing in Theorem 3 can now be relaxed as in (26). Roughly speaking, one can think of do 
being bounded as follows for the Conic programming estimator (7): 



That is, when tb decreases, we allow larger values of do', however, when tb —)• 0, the sparsity level 
of d = O {f/log{m/d)) starts to dominate, which enables the Conic Programming estimator to achieve 
results similar to the Dantzig Selector when the design matrix Xq is a subgaussian random matrix satis¬ 
fying the Restricted Eigenvalue conditions; See for example Candes and Tao (2007); Bickel et al. (2009); 
Rudelson and Zhou (2013). 
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The condition on d (and for the Lasso estimator as defined in (25) suggests that as tb —>• 0, and thus 
—)• 0 the requirement on the sparsity parameter d becomes slightly more stringent when x 1 

and much more restrictive when = o(l); however, suppose we require 


that is, the stochastic error e in the response variable y as in (la) does not converge to 0 as quickly as the 
measurement error VL in ( lb) does, then the sparsity constraint becomes essentially unchanged as —)• 0 . 
In this case, essentially, we require that for some c" := 


d<C'A 


f f 


logm 




A 1 


given that 


rtK^ 


where x 
K'^M^ 


and := 

% 


64Mj’ 




< 


bi 


These tradeoffs are somehow different from the behavior of the Conic programming estimator (cf (32)). 


5 Proof of theorems 


j-r/3* 


as stated in Lemma 6 . This entity 


We first consider the following large deviation bound on 
appears in the constraint set in the conic programming estimator (7), and is directly related to the choice of 
A for the lasso-type estimator in view of Theorem 8 . Events Bq and Bio are defined in Section B.2 in fhe 
Appendix. 

Lemma 6. Suppose (A1) holds. Let X = Xq + W, where Xq, W are as defined in Theorem 2. Suppose that 

||i?||p / ||i 7||2 > log m where m > 16. 

Let r and 7 be as in (5). On event Bq, we have for D 2 = 2(|| AII 2 + ||71||2) tind some absolute constant Cq 


< where = C 0 D 2 K {K 

00 V / 


+ M, 


is as defined in Theorem 2. Then P (Bq) > 1 — IQ/mfi. 

Lemma 7. Let m > 2. Let X be defined as in Definition 1.2 and tb be as defined in (4). Denote by 
Tb = tr(i7)// and ta = tr(A)/m. Suppose that f V {r{A)r{B)) > logm. Denote by Bq the event such 
that 


2CoK‘^^ 

logm 

mf 

V s/m 

IBIIri 

Vf J 

= : DiK\, 

y/m 

V 7 

Gfld 

= 2Co^ 

/logm 
/ mf 


where Di := 


Then P {Bq) > 1 — If we replace y/fogm with log m in the definition of event Bq, then we can drop the 
condition on f or r{A)r{B) = to achieve the same bound on event Bq. 
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We prove Lemma 7 in Section B.3 in the Appendix. We prove Lemma 6 in Section C.l. We mention in 
passing that Lemma 6 is essential in proving Theorem 3 as well. 

We state variations on this inequality in Lemma 14 and the remark which immediately follows. 

Theorem 8. Consider the regression model in (la) and (lb). Let d < //2. Let 7, L be as constructed in 
(5). Suppose that the matrix T satisfies the Lower-RE condition with curvature a > 0 and tolerance r > 0, 


sfdr < min 


a A 1 

32Vd’^/ 


(33) 


where d, 60 cmd A are as defined in ( 6 ). Then for any d-sparse vectors (3* S R™, such that 


< bo and 


7-r/3 

fi-13 


< — A, the following bounds hold: 

00 2 

80 


20 , ^ 

< —Av a, and 


a 


13- 


< —Xd 
1 a 


(34) 

(35) 


where (3 is an optimal solution to the Lasso-type estimator as in ( 6 ). 

We defer the proof of Theorem 8 to Section D, for clarity of presentation. In section 5.1, we provide 
two Lemmas 9 and 10 in checking the RE conditions as well condition (33). One can then combine with 
Theorem 8 , Lemmas 6 , 9 and 10 to prove Theorem 2. In more details, Lemma 9 checks the Lower and the 
Upper RE conditions on the modified gram matrix: 

f A := X^X - iT{B)L^ (36) 


while Lemma 10 checks condition (33) as stated in Theorem 8 for curvature a and tolerance r as derived in 
Lemma 9. Finally Lemma 6 ensures that (34) holds with high probability for A chosen as in (17). We defer 
stating these lemmas in Section 5.1. The full proof of Theorem 2 appears in Section 9. 

For Theorem 3, our first goal is to show that the following holds with high probability 




iX^(y-X/3*) + irr(i7)/3^ 


< t\ 


+ T, 


where p,T are as chosen in (43). This forms the basis for proving the Iq convergence, where q G [1,2], 
for the Conic Programming estimator (7). This follows immediately from Femma 6 . More explicitly, we 
will state it in Femma 11 . Before we proceed, we first need to introduce some notation and definitions. Fet 
Xq = be defined as in Definition 1.2. Fet /cq = 1 + A. First we need to define fhe ^g-sensifivity 

parameter for 'k := jXqXq following Belloni et al. (2014): 


K-qido, ko) 
Conej{ko) 


l^-AI 


mm mm 
J-.\J\<do AsConej(fco) 


|A| 


where 


= {x G R™ I s.t. ||xjc||^ < fco ||xj|U 


(37) 

(38) 


See also Gautier and Tsybakov (201 1). Fet {fi, t) be the optimal solution to (7) and denote by v = 13-/3*. 
We will state the following auxiliary lemmas, the first of which is deterministic in nature. The two lemmas 
reflect the two geometrical constraints on the optimal solution to (7). The optimal solution (3 satisfies: 


1 . V obeys the following cone constraint: ||u 5 c||^ < ko\\vs 


I^ and t < j 


+ 


12 - 
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2. Il'I'f Iloo is upper bounded by a quantity at the order of O (/r(||/3* ||2 + ll'*^lli) + '^) 

Now combining Lemma 6 of Belloni et al. (2014) and an earlier result of the two authors (cf. Theo¬ 
rem 25 Rudelson and Zhou (2013)), we can show that the RE(2(io) 3(1 + A), A^/^) condition and the sample 
requirement as in (27) are enough to ensure that the ^^-sensitivity parameter satisfies the following lower 
bound for all 1 <q<2 : for some contant c, 

Kq{do,kQ) > which ensures that for v = f3 — f3*, 

||4'u||^ > Kq{do,ko)\\v\\g> \\v\\^ where = (39) 

Combining (39) with Lemmas 11, 12 and 13 gives us both the lower and upper bounds on ||'I'u||qq, with 
the lower bound being Kg{do,ko) ||u||g and the upper bound as specified in Lemma 13. Following some 
algebraic manipulation, fhis yields fhe bound on fhe ||u||g for all 1 < q < 2. We sfafe Lemmas 11 to 13 in 
Section 5.2 while leaving the proof for Theorem 3 in Section 10. 


5.1 Additional technical results for Theorem 2 

The main focus of the current section is to apply Theorem 8 to show Theorem 2, which applies to the general 
subgaussian model as considered in the present work. We first state Lemma 9, which follows immediately 
from Corollary 19. First, we replace (A3) with (A3’) which reveals some additional information regarding 
the constant hidden inside the O(-) notation. 

(A3’) Suppose (A3) holds; moreover, for D 2 = 2(||A||2 -|- ||77||2)> logm/A^;^(A) or 

equivalently. 


Amin(A) 

^Il2 + 11-^ 



for some large enough contant Ck- 


Lemma 9. (Lower and Upper-RE conditions) Suppose (Al), (A2) and (A3’) hold. Denote by V := 
where Ma is as defined in (13). Let sq be as defined in (12). Suppose that for some d > 0, 


tr{B) 

\\B\L 


> log 


3em 

soe 


where e = 


2Ma 


(40) 


Let Aq be the event that the modified gram matrix Ta as defined in (36) satisfies the Lower as well as Upper 
RE conditions with 


curvature 

and tolerance 


a = -Amin(A), smoothness a = 3Amax(A)/2, 

512C^ro(so)^ logm a 1024(7^^^(50 + 1) log m 

-A T := — A- 

Amin (A) / 'So Amin (A) / 


for a, a and r as defined in Definitions 2.2 and 2.3, and C, so,zu{so) in (12). 


Then P (Aq) > 1 — 
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Lemma 10. Suppose all conditions in Lemma 9 hold. Suppose that sq > 3 and 


d := |supp(r)| < A 2} where Ca := 

/ /^2 1^2 \ 

where d, (f), bQ,M^ and K are as defined in Theorem 2, where we assume that ||/3* II 2 > for some 0 < 
(p < 1. Then the following condition holds 



(42) 


where "0 is as defined in (17) and a = \^\^{A)/2. 

We prove Lemmas 9 and 10 in Sections D. 1 and D.2 respectively. 

Remark 5.1. Clearly for d,bQ,(f> as bounded in Theorem 2, we have by assumption (15) the following upper 
and lower bound on D^: 

2KU > := > K^d- 

In this regime, the conditions on d as in (41) can be conveniently expressed as in (16). 


5.2 Technical lemmas for Theorem 3 


We state the technical lemmas needed for proving Theorem 3. The proof for Lemma 12 follows directly 
from that in Belloni et al. (2014) in view of Lemma 11. 

Lemma 11. Suppose all conditions in Lemma 6 hold. Then on event Bq as defined therein, the pair (/3, t) = 


(/?*, 11 / 3 * 112 ) belongs to the feasible set of the minimization problem (7) with r^j ■= CqK 



p X 2D2Krm,f and r x DQM^rm,f (43) 

where Dq = {y/rB + y/dffff) and D 2 = 2(||^||2 + ||73||2) as in Theorem 3. 

Lemma 12. Let p,T > 0 be set. Suppose that the pair {/3,t) = (/3*, ||/3*||2) belongs to the feasible set of 
the minimization problem (l),for which {(3,t) is an optimal solution. Denote by v = (3 — /3*. Then 


< (l + -^)lk5|li and t < Y ll^lli + Il/^*ll 2 

A 

Lemma 13. On event Bq n Bio< 

ll^^^lloo < Tl 11/3112 + 3^2 Iblll +T 

where pi = 2p, p 2 = b{j + 1) and t' = 2t for p, r as defined in (43). 

We prove Lemmas 11, 12 and 13 in Section E. 1. 
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5.3 Improved bounds for the Lasso-type estimator 


We give an outline illustrating where the improvement for the lasso error bounds as stated in Theorem 4 come 
from. We emphasize the impact of this improvement over sparsity parameter do- The proof for Theorem 4 
follows exactly the same line of arguments as in Theorem 2 except that we now use the improved bound 


on the error term 


7-r/3* 


given in Lemma 14 instead of that in Lemma 6 which is used in proving 

OO 


Theorems 2 and 3. See Section 11 for details, as well as the proof for Theorem 4 and the following two 
lemmas. 

Lemma 14. Suppose all conditions in Lemma 6 hold. Let Dq,Dq, f^oracie. := be 

as defined in (22) and (23). On event Bq, 


7-fr 

where fi) := CqK ||/3*||2 + DqM^. Then P {Bq) > 1 — 16/m^. 

Moreover, we replace Lemma 10 with Lemma 15, the proof of which follows from Lemma 10 with d now 
being bounded as in (25) and '0 being redefined as immediately above in (44). 

Lemma 15. Suppose all conditions in Lemma 9 hold. Suppose that (25) holds. Then (42) holds with ijj as 
defined in Theorem 4 and a = Aniin(^)/2- 


< 


tog m 
f 


(44) 


5.4 Improved bounds for the DS-type estimator 


An “oracle” rate for the Conic programming estimator (7) is defined as follows. Recall the following no¬ 
tation: rmj = CqK The trick is that we assume that we know the noise level in W by knowing 
tb ■= tr{B )//, then we can set 

+ Dors^le/^/m)KrmJ while retaining t DoMr^j 


in view of the improved error bounds over 


j-rp* 


as given in Lemma 14. Without knowing this 

CX) 


parameter, we could rely on the estimate from tb as in (4), which is what we do next. For a chosen 
parameter Cq, we use + C^Kr\l^m to replace := + 77oj.acie/\/^ and set 


p X CoDq{tI/‘^+ where Ce > L>oi.acie, 

rm,m = and Z^o^ade := 2(||A||^/" + ||i?||^/'). 

y mj m 

Notice that we know neither D'q nor i7oracie> where recall D'q = + Omax- However, assuming that 

we normalize the column norms of design matrix X to be roughly at the same scale, we have 


Dq X 1 while i2oracie/vA« = o(l) in casc || A ||2 , ||Z ?||2 < M 

for some large enough constant M. This is crucial in deriving and putting the faster rates of convergence in 
estimating fd and in predictive error ||2f?;||2 when tb = o(l) in perspective, in view of Lemmas 16 and 18. 
Lemma 16 follows directly from Lemma 14. 
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Lemma 16. Suppose all conditions in Lemma 14 hold. Let Dq = + -y/Omax) 1 under (A1). Then 

on event Bq, the pair (/3, t) = (/?*, ||/3* II 2 ) belongs to the feasible set T of the minimization problem (7) with 

b > DQTp'^Krmj and r > DoMeVmj- (45) 


where := is as defined in (23). 

-—1 /2 

Lemma 17. On event Bq and (Al), the choice of as in (30) satisfies for m > 16 and Cq > 1, 
+/2 ~l/2 1/2 3 t/2 

tb < 2rB + SCgK'^rmm ^ <ind moreover < 1 


(46) 

(47) 


We next state an updated result in Lemma 18. 

Lemma 18. On event Bq H Biq, the solution fi to (7) with p, r as in (30) and (29), satisfies for v := fi — fi* 


jX^Xov 


< bl 11/3*112 + b2 ||u||i +t' 

cx> 


where pi = 2p, p 2 = 2/t(l + and t' = 2t. 


6 Lower and Upper RE conditions 

The goal of this section is to show that for A defined in (51), the presumption in Lemmas 32 and 34 as 
restated in (48) holds with high probability (cf Theorem 20). We first state a deterministic result showing 
that the Lower and Upper RE conditions hold for T^ under condition (48) in Corollary 19. This allows 
us to prove Lemma 9 in Sections D.l. See Sections G and H, where we show that Corollary 19 follows 
immediately from the geometric analysis result as stated in Lemma 34. 

Corollary 19. Let 1/8 > <5 > 0. Let 1 < C < 'm/2. Let Amxm be a symmetric positive semidefinite 
covariance matrice. Let T Abe an m x m symmetric matrix and A = T^ — A Let E = U| where 

Ej = span{ej : j G J}. Suppose that Vu, v ^ E n S'^~^ 

\u^Av\ <6< ^Amin(^). (48) 

O 

Then the Lower and Upper RE conditions holds: for all v G R™^, 

TA \ II ||2 ^min(^) n ||2 //in\ 

V TaV > -Amm(^) ||'y|l2-(49) 

TA ^ 3. //(Nil ||2. ^min(^) n ||2 

V Tav < -Amax(7i) ||r ;||2 H-(50) 

Theorem 20. Let A^xm: ^fxf be symmetric positive definite covariance matrices. Let E = U| j|<^ii'j/or 
1 < C < ^-72. Let Z,X be f X m random matrices defined as in Theorem 2. Let tb be defined as in (4). 
Let 


A := f A - Al := jX^X - ^Blm - A. 


(51) 
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Suppose that for some absolute constant f > 0 and 0 < e < i 


ti{B) 


\B\ 


c 

> ( log ) V 


m 


(52) 


where C = Cq! sfd for Cq as chosen to satisfy (86). 

Then with probability at least 1 — 4exp C2£^ j^*4||B|| ) “ 2exp C 2 e^-^^ — Qlmf, where C 2 > 2, we 
have for all u,v £ E Cl S'^~^ and tu(() = rs + Pmax(C> ^). tmd Di < 


y/m 


+ 


V7 ’ 


\u^Av\ < 8Cw{C)e + ACqDiK'^ 


llogm 

mf 


We prove Theorem 20 in Section I. As a corollary of Theorem 20, we will state Corollary 23 in Section 7. 


7 Concentration bounds for error-corrected gram matrices 


In this section, we show an upper bound on the operator norm convergence as well as an isometry property 
for estimating B using the corrected gram matrix B := — tr(A)/j). Theorem 21 and Corollary 22 

state that for the matrix B ^ with the smaller dimension, B tends to stay positive definite after this error 
correction step with an overwhelming probability, where we rely on / being dominated by the effective rank 
of the positive definite matrix A. When we subtract a diagonal matrix from the gram matrix jX'^X 
to form an estimator, we clearly introduce a large number of negative eigenvalues when f ^ m. This in 
general is a bad idea. However, the sparse eigenvalues for A can stay pretty close to those of A as we will 
show in Corollary 23. 

Theorem 21. Let £ > 0. Let X be defined as in Definition 1.2. Suppose that for some c' > 0 and 

0 < £ < 1 / 2 , 


^ ^,^,^4log(3/£) 

Mil - 

Then with probability at least 1 — 2 exp (—c£^-^) — 4 exp 

< C2£ {ta + II.BII 2 ) 

m 2 

where C 2 ,c^ are absolute constants depending on cf,C, where C > 4max(^, 
constant. 

Corollary 22. Suppose all conditions in Theorem 21 hold. Suppose 


(53) 


IXX^ - _ B 

m 


1 


cc' ’ Cici? • 


is a large enough 


(54) 


where C 3 = C 2 ^ ^■\b) ^ in Theorem 21. Then with the probability as stated in Theorem 21, 


{l + 26)B >~ 


XX^ tr(A)/y 


m m 

where for the last inequality to hold, we assume that Xmin{B) > 0 . 


^ (1 - 26)B y 0 
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Next we show a large deviation bound on the sparse eigenvalues of the eiTor corrected A: A := 


LX-X- 


Corollary 23. Let X be defined as in Definition 1.2. Let A := X — TBim- Suppose 

^ /; j_^4log(3em/A:e) 

jBg - • 

Then with probability at least 1 — 2 exp(—C 4 e^-^) — 4exp(—C 4 e^i^^j|^|p), 

Praa.x{k,A) < Prai,x{k, A){1 + lOe) + CaSTb 
where C 4 is an absolute constant. Moreover, suppose for V 1 

\ PminV^s-^/ 


IBII 2 “ (^2 


(55) 


(56) 


Then with the probability as stated immediately above, we have 

Prmn{k,A^ > j 4)(1 2 ( 5 ). 

We prove Theorem 21 in Section J. We also prove the concentration of measure bounds on error-corrected 
gram matrices in Corollaries 22 and 23 in Sections J. 1 and J.2 respectively. 


8 Proof of Lemma 1 

We define Cone((io, ^o), where 0 < do < ^ and feo is a positive number, as the set of vectors in R™ which 
satisfy the following cone constraint: 

Cone(do,/co) = {ic G R”* I 3/ G {1,... ,p}, |/| = do s.t. Hx/cjlj^ </cq . 

For each vector x G R^, let To denote the locations of the so largest coefficients of x in absolute values. The 
following elementary estimate Rudelson and Zhou (2013) will be used in conjunction with the RE condition. 
Lemma 24. For each vector x G Cone(so, ko), let Tq denotes the locations of the sq largest coefficients of 
X in absolute values. Then 


I^Tolla > 


\/l + ^0 


(57) 


Proof of Lemma 1. Part I: Suppose that the Lower-RE condition holds for T := A"^A. Let x G 
Cone(so, ko). Then 

||x||i < (1 + ko) lIxTolli ^ (1 + ^o)v^lkToll 2 • 
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Thus for X G Cone(so) ^o) H ^ and r(l + ko)^so < aj^, we have 
\\Ax\\ 2 = {x'^A xY^'^ > ||a;||2 — r ||x||^^ 

1 

> ||a;||2 - t( 1 + A:o)^so I^Toll^) 

> {a - t{ 1 + kof 

Thus the RE(so, fco, A) condition holds with 

1 


Pa:|| 

K{sQ,kQ, A) x£Cone(so,ko) ||iCrol| 


mm 


where we use the fact that for any J G {1,... ,p} such that | J| < sq, ||a:j ||2 < ||a:ro Il 2 - We now show the 
other direction. 

Part II. Assume that RE(4i2^, 2i? — 1, A) holds for some integer i? > 1. Assume that for some i? > 1 

||x||i < R ||x||2 . 


Let «)f= be non-increasing arrangement of {\xi\)^^^. Then 


\x 


ii < -R|EAf + E 

1=1 i=s+i 


oo / II II \ 2 

x\ 


1/2 


^ ^ ( Ikjll2 + ll^lll ~ 


where J := {1,..., s}. Choose s = 4i?^. Then 


1/2 


< R Ikjll2 + ll^lll 




Thus we have 


|a;||i < i?||x}||2 + - ||a:||i. 


Ilxlli < 2R\\x*j\\2<2R\\x*j\\^ 

< i2R-l)\\x*j\\,. 


and hence 


(58) 

(59) 


Then x G Cone(4ii^, 2R — 1). Then for all x G 5^ ^ such that ||x||j^ < R ||x|| 2 , we have for ko = 2R — 1 
and So := 4i?^, 


x'^Tx > __ > _[nll2_ =. Q, |U||2 

K^{so,ko,A) y/^K‘^{so,ko,A) 

where we use the fact that (1 + /cq) II^^ToII^ ^ Il®ll 2 Lemma 24 with xtq as defined therein. Otherwise, 
suppose that HxH^ > R ||x|| 2 . Then for a given r > 0, 


a \\x 


— T \\X 


\l<{ 


^K‘^{so,ko,A) 


-tR^) 


|2 
I 2 • 


(60) 
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Thus we have by the choice of r as in (29) and (60) 


x'^Tx > Amin(r) ||x||2 > ( 


^K^{so,ko,A) 


— tR^) \\x\\\ 


^ II l|2 II ||2 
> a \\x \\2 — T ||x||;^ 


The Lemma thus holds. □ 


9 Proof of Theorem 2 


First we note that it is sufficient to have (14) in order for (40) to hold. (14) guarantees that for V = 

3eMl/2 


r{B) := 


tr{B) 


\B\ 


> I6c K 


f , Vmlogm 


logm 


log 


/ 


> 16c iT 




/ f ‘iemM\ log 


m 


= dK 


, 4 1 4 / , / QemMA 


M 4 log m 


log 


\MT(//logm) 


^ , f 6emMA\ /3em 

> cK -^so log [-^- I = c K ^ log 


V ■So 


So£ 


(61) 


where e = and the last inequality holds given that k\og{cm/k) on the RHS of (61) is a 

monotonically increasing function of k, and 


So < 


and Afn = A) + tb) ^ ^ 

M\\ogm Amin (A) 


Next we check that the choice of d as in (16) ensures that (41) holds. Indeed, for d < 1, we have 

d < Ca{c'K^ Al)j^< Ca {o'A 1)-^. 

logm ^ logm 


By Lemma 9, we have on event ^o^ the modified gram mafrix T^ := j{X^X — tr(i?)/m) satisfies the 
Lower RE conditions with 


curvature a = -Amin(g4) and tolerance r = 1 ^ 222 ^—1 = 

2 2so So 


(62) 


Theorem 2 follows from Theorem 8 , so long as we can show that condition (33) holds for A > 4V^y 
where the parameter ■0 is as defined (17), and a and r = ^ are as defined immediafely above. Combin¬ 
ing (62) and (33), we need fo show (42) holds. This is precisely fhe confenf of Lemma 10. This is fhe end 
of fhe proof for Theorem 2 □ 
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10 Proof of Theorem 3 


For the set Conej(A:o) as in (57), 


i^RE{do,ko) := 




mm mm 
J:\J\<do AsConej(fco) 


|A„ 


K{do,ko,{l/Vf)ZiAy^) 


Recall the following Theorem 25 from Rudelson and Zhou (2013). 

Theorem 25. Rudelson and Zhou (2013) Set 0 < <5 < 1, feo > 0, and 0 < do < p. Let be an m x m 
matrix satisfying RE((io) 3A:o, ^^/^) condition as in Definition 2.1. Let d be as defined in (63) 


d 


do + do max 
j 



2 167^2(^0, 3ko, A^/^){3kof{3ko + 1) 

2 <52 


(63) 


Let ^ be annx m matrix whose rows are independent isotropic '02 random vectors in R™ with constant a. 
Suppose the sample size satisfies 


Then with probability at least 1 
matrix {l/y/n)'^A with 


2000da‘^ , 
n > -^2-log 


/ 60em\ 


(64) 


2 exp(—(5^n/2000a^), HE.{do,koi{'\-/y/n)'^A^/'^) condition holds for 


D<K{do,ko,{l/V^)^A^/^) < 


Kido,ko,Ay^) 

1-6 


(65) 


Proof of Theorem 3. Suppose RE(2do) 3A:o, A^/^) holds. Then for d as defined in (28) and / = 
Ll{dK‘^ log(m/d)), we have with probability at least 1 — 2 exp((i^//2000iT^), the RE(2(io, ko, -^ZiA^^'^) 
condition holds with 

KRE(2do,fco) = (yK^2do,ko,{l/V7)ZiAy^)) - {2K{2do,ko,Ay^)) 
by Theorem 25. 

The rest of the proof follows from Belloni et al. (2014) Theorem 1 and thus we only provide a sketch. In 
more details, in view of the lemmas shown in Section 5, we need 

Kq{do,ko) > 

to hold for some constant c for T' := jX^ Aq. It is shown in Appendix C in Belloni et al. (2014) that under 

the RE(2(io) -^Z\PAI‘^') condition, for any do < m/2 and 1 < g < 2, we have 

Ki{do,ko) > cd(/^KRE(do,^o), 

Kq{do,ko) > c(g)d0^^''KRE(2do,^o) (66) 

where c{q) > 0 depends on ko and q. The theorem is thus proved following exactly the same line of 
arguments as in the proof of Theorem 1 in Belloni et al. (2014) in view of the iq sensitivity condition 
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derived immediately above, in view of Lemmas 11, 12 and 13. Indeed, we have for n := /3 — /3*, we have 
by definition of iq sensitivity as in (37) 


llnl 


< 

< 

< 

< 

< 


K.q{dQ,ko) IlnL < 


7^0 


Xov 


W 11/3*112 + /^2 IMIl +T 
/^i 11/3* II 2 + /^ 2(2 + A) Ill’s 111 + T 


/^1 11/3*112 + /^2(2 + A)(ig ll^’sllq + T 
11/3*112 +/^2(2 + A)(iQ Ibllg + T. 


Thus we have for do = cq y^// log m where cq is sufficiently small, 

f/o^'^'^(c(7)KRE(2do,/=o) -ft2(2 +A)do) ll^^ll^ < ||/3*||2 + t 

hence ||r;||g < C{4:D2rmjK ||/3*||2 + 2 DQM^rmj)dl^'^ 
<ACD2rmj{K\\l3*\\^ + M,)dl^'^ 


(67) 


for some constant (7 = 1/ {c{q)Kf{E{2do, ko) — ^12(2 + X)do) > 1/ {2c{q)Kp{E{2do, ko)) given that 
ft2(2 + A)do = 2D2Kririj{~^ + 1)(2 + A)co\///logm = 2 cqCqD 2 K ‘^{2 + '^)('^ ~k 1) 

is sufficiently small and thus (21) holds. The prediction error bound follows exactly the same line of ar¬ 
guments as in Belloni et al. (2014) which we omit here. See proof of Theorem 5 in Section F for details. 

□ 


11 Proof of Theorem 4 


The proof is identical to the proof of Theorem 2 up till (62), except that we replace the condition on d as 
in the theorem statement by (25): that is. 


d 


|supp(/3*)| <Ca ^J {c^< 70A2} where (7 a := 


logm 

I^IU + Omax f K'^M: 


128M2 ^ 


D2 






> 


I dd 112 

772 


‘ B 


where c', (/>, 60, and K are as defined in Theorem 2, where we assume fhat 6 q > ||/3* II 2 > (j)d'o for some 0 < 
(j) < 1. Theorem 4 follows from Theorem 8, so long as we can show that condition (33) holds for 

A > 2?/’y^/2£iL where the parameter ip is as defined (44), and a and r = -^ are as defined in (62). Combin¬ 
ing (62) and (33), we need to show (42) holds. This is precisely the content of Lemma 15. This is the end 
of the proof for Theorem 4. n 
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12 Proof of Theorem 5 


Throughout this proof, we assume that Bq n Biq holds. The rest of the proof follows that of Theorem 3, 
except for the last part. Let ^i, ^ 2 ,t be as defined in Lemma 13. We have for fi 2 ■= 2/r(l + where 
/i = D'QKrmjry^, and do = cqt^ // log m. 


^Ji2{2 + \)do = 2 CoD'^K^t]/\^ + 1){2 + X)cot^ (68) 

< 2coC'oL>o-^^(2 + + 1) < -j^c{(l)>^RE{2do, ko) 

—~1 /2 

which holds when cq is sufficiently small, where by (47) < 1. Hence 

, ^ c{q)KfiE{2do,ko) 

- 2(2 +A) 

Thus for Co sufficiently small, fii = 2^, by (66), (68), (67) and (46), 


= cfo ^'^''(c(7)KRE(2do, ko) - /i 2(2 + A)do) ll^^llg 

< {Kq{do,ko) - H 2{2 + X)dl~^^‘^) ||u||^ < m ||/3*||2 + r 

< 277'r^jiT2((4/2 ^ ^y2)CoKrU%) \\/3*\\, + MJK) (69) 

and thus (31) holds, following the proof in Theorem 3. The prediction error bound follows exactly the same 
line of arguments as in Belloni et al. (2014), which we now include for the sake completeness. Follow¬ 
ing (31), we have by (69), 


1 


and hence /i 2 ||u||i 


< Ciido(Fi 11/5*112 + '^) where Cn = 2/ (c(g)KRE(2do,/ cq)) 

< CiifX2do{fJ-i\\/3*\\2 +t) 

- 2(2 + A) (c(9)«^RE(2do, A:o)) (/ri ||/5*||2 + t) 


Thus we have by (69), the bounds immediately above, and (47) 
2 


7 X{/3-/3*) 


< k; 


^X^Xov 


< CudoifJ-i 11/3*112 + r) (^1 11/3*112 + fJ -2 ll^^lli + 2 t ) 

< Cndo(/ii 11/3*112 + r)(l + (w \\n \2 + 2r) 


2 +A' 


/ 1/2 


= C'{D'^YK^do 


f 


B 


+ 


M, 


< C"(||H||2 + ara..)K^do^^^ [{2 tb + 3ClK\m,m)K^ 


3*\\l + M^ 


where (Dq)^ < 2 ||i3||2 + 2amax- The theorem is thus proved. □ 
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13 Conclusion 


In view of the main Theorems 2 and 3, at this point, we do not really think one estimator is preferable to 
the other. While the rates we obtain for both estimators are at the same order for q = 1,2, the conditions 
under which these rates are obtained are somewhat different. Lasso estimator allows large values of sparsity, 
while Conic-programming estimator conceptually is more adaptive by not fixing an upper bound on ||/3*||2 
a priori, the cost of which seems to be a more stringent requirement on the sparsity level. The lasso-type 
procedure can recover a sparse model using O(logm) number of measurements per nonzero component 
despite the measurement error in X and the stochastic noise e while the Dantzig selector-type allows only 
d X \/f/ log m to achieve the error rate at the same order as the Lasso-type estimator. 

However, we show in Theorem 5 in Section 5.4 that this restriction on the sparsity can be relaxed for 
the Conic programming estimator (7), when we make a different choice for the parameter fx based on a 
more refined analysis. Eventually, as tb —> 0, this relaxation on d as in (32) enables the Conic Program¬ 
ming estimator to achieve bounds which are essentially identical to the Dantzig Selector when the design 
matrix Xq is a subgaussian random matrix satisfying the Restricted Eigenvalue conditions; See for exam¬ 
ple Candes and Tao (2007); Bickel et al. (2009); Rudelson and Zhou (2013). Eor the Easso estimator, when 
we require that the stochastic error e in the response variable y as in (la) does not converge to 0 as quickly 
as the measurement error W in (lb) does, then the sparsity constraint becomes essentially unchanged as 
Tb —)■ 0. These tradeoffs are somehow different from the behavior of the Conic programming estimator 
versus the Easso estimator; however, we believe the differences are minor. 


We now state a slightly sharper bound than those in Eemma 14 which provides a significant improvement 
on the error bounds in case tb = o(l) while ||^||2 > 1 for the Easso-type estimator in (6) as well as the 
Conic programming estimator (7). Recall Dq := y^IFbH^ -|- Om^- By (74), 


7-r/3^ 




1/2 


J + 


2DiK 


m 


, '^m,f T 


When Tb —)• 0, we have for Dq 


! - I 1/2 

yD? + Omax 


1/2 

max 


7-r/3^ 


o 



+ DoiTMe 



logm 


where D\ = 11^112^^ under (Al), given that ||i?||^ /y/f < t^^"^ ||i?|| 2 ^^ —)• 0, and the first 

term inside the bracket comes from the estimation error in tr{B )//, which can be made go away if we were 
to assume that tr(77) is also known. In this case, the error term involving ||/3*||2 in (17) vanishes, and we 
only need to set 


where xjx - DqKM, + WAWl^^ \\/3*\\^ . (70) 

Moreover, suppose that tr(i?) is given, then one can drop the second term in ?// as in (70) and hence recover 
the lasso bound when the design matrix X is assumed to be free of measurement errors. 

Einally, we note that the bounds corresponding to the Upper RE condition as stated in Corollary 19, 
Theorem 20 and Eemma 9 are not needed for Theorem 2. They are useful to ensure algorithmic con¬ 
vergence and to bound the optimization error for the gradient descent-type of algorithms as considered 
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in Loh and Wainwright (2012), when one is interested in approximately solving the non-convex optimiza¬ 
tion function (6). Our numerical results validate such algorithmic and statistical convergence properties. 
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A Outline 

In Sections B and B.2, we present variations of the Hanson-Wright inequality as recently derived in Rudelson and Vershynin 
(2013) (cf. Lemma 27), concentration of measure bounds and stochastic error bounds in Lemma 29. 

In Sections C and E, we prove the technical lemmas for Theorems 2 and 3 respectively. In Section F, we 
prove the Lemmas needed for Proof of Theorem 5. In order to prove Corollary 19, we need to first state 
some geometric analysis results Section G. We prove Corollary 19 in Section H and Theorem 20 in Section 1. 

Results presented in Section 7 are proved in Section J. In particular, we prove Theorem 21 in Section J. We 
also prove the concentration of measure bounds on error-corrected gram matrices in Corollaries 22 and 23 
in Sections J. 1 and J.2 respectively. The results appearing in Section J are proved in Section K. 

B Some auxiliary results 

We first need to state the following form of the Hanson-Wright inequality as recently derived in Rudel¬ 
son and Vershynin Rudelson and Vershynin (2013), and an auxiliary result in Lemma 27 which may be of 
independent interests. 

Theorem 26. Let X = (Xi,..., Xm) G R™ be a random vector with independent components Xi which 
satisfy E (Xi) = 0 and || < K. Let A be an m x m matrix. Then, for every f > 0, 


P {\X'^AX - E {X^AX) I > f) < 2 exp 


—cmin 



We note that following the proof of Theorem 26, it is clear that the following holds: Let X = {Xi ,..., X^) S 
R”^ be a random vector as defined in Theorem 26. Lef V, Y' be independenf copies of X. Lef A be an m x m 
mafrix. Then, for every t > 0, 



(71) 


We nexf need fo sfafe Lemma 27, which we prove in Secfion B.l. 

Lemma 27. Let u,w G S-l" Let A >- be a m X m symmetric positive definite matrix. Let Z be an 
f X m random matrix with independent entries Zij satisfying KZij = 0 and \\Zij\\_^^ < K. Let Zi, Z^ be 
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independent copies of Z. Then for every t > 0, 




> t] < 2exp —cmin 




¥ ZAZ"^w — ZAZ"^w\ > t) < 2exp 

where c is the same constant as defined in Theorem 26. 


—cmin 


K*\\A\\y 


B.l Proof of Lemma 27 


Lemma 28 is a well-known fact. 

Lemma28. Let Au^ := {u®w)®A where u,w ^ 8^“^ wherep > 2. Then ||74uu,||2 < 

F ■ 


2 und 11 A^yj 11^^ 


Proof oi Lemma 27. Let zi,..., Zf, z'l,..., z'j € R”* be the row vectors Zi, Z 2 respectively. Notice 
that we can write the quadratic form as follows: 

tFZiA}!"^Z2W = ^ UiWjZiA^/‘^z’j 


i,j= 




= vec 

{zlL 

((u (g) w 

) (g) A^/^)vec { ZJ } 

=: vec 

{zlL 

Ait^vec 


= vec 

{Z-P 

((u (g) w 

) (g) A)vec { Z"^ } 

=: vec 

{Z-P 

Auwvec 

{Z^} 


where clearly by independence of Zi, Z 2 , 

Evec { ((u (g) w) (g) A^/^)vec { Zj } = 0, and 

Evec { Z l"*" ((u (g) u) (g) A)vec { Z } = tr((M (g) u) (g) A) = tr(A). 

Thus we invoke (71) and Lemma 28 to show the concentration bounds on event {|u^ZiA^/^Zjrc| > t}: 


P I 




/ 

/ 




\\ 

( 

u^ZiA^/^Z^w 

> f) <2 exp 

— min 




t 


\ 


/ 

1 


a 1/2 

■^UW 

^ iT 2 
F 

A 

■^uw 

2 )) 


< 2 exp — min 


in - 


\^iT4tr(A)’A'2||A1/2|| 

Similarly, we have by Theorem 26 and Lemma 28, 

P (Itt^ZAZ'^rc - En^ZAZ'^'«;| > t) 

( / +2 r \ \ 

< 2 exp —cmin 

< 2 exp —cmin 






VK^ 
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The Lemma thus holds. □ 


B.2 Stochastic error terms 

The following large deviation bounds in Lemmas 29 and 7 are the key results in proving Lemmas 6 and 13. 
Let Co satisfy (86) for c as defined in Theorem 26. Throughout this section, we denote by: 


-n /logm , _ /logm 

^m,f — ^0^\l ^ and Tjyi^Yn — 


/ 


mf 


We also define some evenfs 8^,3^, Bio', Denofe by .So := B 4 n B^ Ci Bq, which we use fhroughouf fhis 
paper. 

Lemma 29. Assume that the stable rank of B, ||i?||^ / ||i7||2 > logm. Let Z,Xo and W as defined in 
Theorem 2. Let Zq, Zi and Z 2 be independent copies of Z. Let YMJK where Y := Let 

tb = Denote by B/i the event such that 


i 

f 

and j 


A2Zfe 

ZfBh 




Then P {Bf) > 1 — Ajimfi. Moreover, denote by B 5 the event such that 
j\\{Z^BZ-tT{B)I^)fi*\\^ < r^jKWfi 
and ^\\X^WI3*\\^ < r^jK\\fi 

Then P {B 5 ) > 1 — Ajm^. 

Finally, denote by Bio the event such that 

j\\{Z'^BZ-tT{B)Im)\ 


I M 

I 2 V^O, 


1/2 

max* 


, < Vm fK 

I max — 


B\ 


and 


< rynjK \\fi*\\2y/TEa 


1/2 

max* 


Then P [Biq) > 1 — 4/m^. 

We prove Lemmas 29 in Secfion B.3. 

B.3 Stochastic error bounds 


s Following Lemma 27, we have for alH > 0, 77 0 being an / x / symmefric positive definile mafrix, 

and v,w £ R™ 


j'^zIb^/^Z2W 


> <2 exp 

—cmin j 



^KHt{B)' k2 I 

Bwrn 

\ > t) < 2 exp 

—cmin 

( 


\K^ \\B\\y K'^ 

IIBllJj 


(72) 
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B.4 Proof for Lemma 29 


Let ei,... ,em £ R™ be the canonical basis spanning R"^. Let xi,..., Xm, x'l,..., € R-^ be the 

column vectors Zi, Z 2 respectively. Let Y ~ ef Zq . Let Wi = 
the stable rank of B guarantees that 


T 7T 1 _ A ! ei foj- ap j Clearly the condition on 


||Ai/2ei| 


/ > r(B) = > WBfelWBt > log™. 


\B\ 


\B\ 


12 11-^112 

By (71), we obtain for t' = C'oMeiLy^tr(R) logm and t = CqK^ y/log mtr(R)^/^: 


(3j, \e^B^/^Z2ej 


>t'] = 


3j, 


M, 

'Y 


ejZ q B^^‘^Z 2 ej > CqM^K y/log mtr(R) 2 


< 


exp(logm)P ^ Y'^B^I’^x'^ > CqK^■\/\ og mtr(R) 2 ^ < 2/m? 


where the last inequality holds by the union bound, given that > log m, and for all j 


Y^B^/^x'j 


> t] < 2exp —cmin 


KHviBY k^WBIY 


^ „ ( ■ { Co logi/2 mv^tr(R) 

< 2exp —cmin Cq logm, — 


\B 


11/2 


< 2exp (—cmin(C'Q, Co) logm) < 2exp (—4logm). 


Let v,w £ S™ Thus we have by Lemma 27, for to = CqM^KY f log m and r = Co7f^-v//log7n, 
Wi = II 1 and / > log m. 


11^1/2 


-'.^112 


(3j, le^ZiiujI > to) <F ^ |y^Zit(;j| > CoM^K^/J\ogm^ 

< mP > CoT^V/logm) 

= exp(logm)P(|efZ^Zin;j| > r) < 2 exp ^-cmin ^ 

. /(Coitr^VJi^)" Coitr^V/i^ 


< 2 exp (^-cmin 

< 2mexp cmin ^Cq log m, Co log^'^^ m 77 )) 

< 2m exp (—cmin(CQ, Co) log m) < 2 exp (—3 log m) 
Therefore we have with probability at least 1 — 4/m^, 


+ logm 




A^zfe 


:= max {e^B^^'^Z 2 ,ej) < t'= CoMf:K^/tr{B)logm 


nax {A^^‘^ej,Z'[e) < . max UC2e max ( Wj , e) 
< a/lYo = a]ll^CoM^K ^ f log m. 


:= max 
00 j= 
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The “moreover” part follows exactly the same arguments as above. Denote by /3* := /?*/ ||/3*||2 G ^ 

and Wi := || By (72) 

P(3i, {wi,Z^B^/^Z2P*) > CoiT^yl^tr(5)1/2^ 

m 

< Y,^(^{wi,Z^B^/‘^Z2^*) > CoiTVlog"itr(5)) 

2=1 

< 2 exp (—cmin (Cq log m, Co logm) + logm) <2lw?. 

Now for t = CQK‘^^J\ogm ||7?||j;', and ||i?||^ / ||77||2 > \/\ogm, 

P(3ei: {e,,{Z^BZ -iT{B)I^)^*) > CqK^ ||B||^) 

< 2 lm^. 

By the two inequalities immediately above, we have with probability at least 1 — 


< 2m exp 


—cmm 


\\B\\l' \\B\\^ 


< 11/3*112 max 


a^i^zIb^i^Z 2 ^* 

sup ( Wi , 

Wi 



< Coi^^ 11 /3* 11 2 y/log 


and 


(Z^BZ - tr(i?)/™)/3* 11^ = 11 {Z^BZ - tv{B)ImW |L W* II 2 
{ei,{Z^BZ-tv{B)I^)P*)^ 

< CoiT2||/3*||2y/l^||i3||^. 

The last two bounds follow exactly the same arguments as above, except that we replace /3* with ej , j = 
1,..., m and apply the union bounds to m? events instead of m, and thus P (^ 10 ) > 1 — 4/m^, □ 


sup 

ei 


C Proofs for the Lasso-type estimator 


C.l Proof of Lemma 6 


Clearly the condition on the stable rank of B guarantees that 

/ > r{B) = > ||i3||^/ ||i3||^ > logm. 


LB 


\B\ 


12 11-^112 

Thus the conditions in Lemmas 29 and 7 hold. First notice that 

7 = j{XlXoP*+ W^XoP*+X^e + W^e) 

- ^l^I^)/3* = Y^I^o + W^Xo + X^^W+ W^W 
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Thus 


7-r/3^ 


7 - } {X^X - P 

= j\\X^e + W^e-{W^W + XjW-tj:{B)Im)P*\\^ 

< j + i \\{W-^W - fr(B)^„)^i•L + 

< ) llXfe + + jiWiZ'^BZ - tr(B);„)/3*D + j WX^Wfl’W 

+ ) \tT{B) - tr(i?)| 11/3*11^ =: U 1 + U 2 + U 3 + U 4 


jX^WP^ 


By Lemma 29 we have on ^84 for Dq := ^/tb + 


1/2 


=7 
Moo J 


A^zfe + Z^Ble <rmjMeDo 


Ui = i||X(fe + Ty^ 
and on event for D'q := + alllx, 

U 2 + U^ = ) \\{Z^BZ - tT{B)Im)P*\\^ + i \\Xjwp* 


I 00 


— '^mj^ 


\\Bh 

Vf 


+ < Krmj 11/3112^^^^0 


where recall ||f3||j7 < ^tr(i?) ||-B||2^^. Denote by Bq := ^4 n ^5 n Bq. We have on Bq and under (Al), by 
Lemmas 29 and 7 and Di defined therein, 


7-r/3^ 


< UI + U2 + U3 + U4 

< VmjM^Do + D'^r^J^Kr^j H/JH^ + j\tr{B) - tr(i?)| ||/31|, 

< DoM.r^j + II/3II2 ' 


< DoM^VmJ + 


2 ' m 


.,/ + ‘^BiK 




(73) 

(74) 




-D 2 + D2 —j= 
4 \/m 


K\\p*\L + DoXI, 


where 2Di < 2 HAH + 2 H-BH = D 2 , for {D'^f < 2 H-BH + 2a 


max 


Dq < Dq < Y^2(||i3||2 + Omax) < 2(amax + — 392, 

and < {\\B\\\B + <tb + ]^{\\B\\^ + Omax) < 

given that under (Al): ta = 1, > O max > a lii > 1. Hence the lemma holds for m > 16 and 

V’ = C 0 D 2 K {K \\P *\\2 + MfP. Finally, we have by the union bound, P {Bq) > 1 — 16/m^. n 
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C.2 Proof of Lemma 7 


First we write 

= (Zi ^ b^/^Z2) {Z ^^ A^/^Zf) - tT{A)If 
= ZxA^I'^ZlB^I'^ + B^/^Z2ZJB^/^ 

+B^/^Z 2 A^/^Zf + ZiAZf - tr{A)If. 


Thus we have for tr(i?) := ^ (||-^||^ — /ti'(^)) 

j{tv{B) - tr{B)) := ||X||^ - /tr(^) - mtr(S)) 

= (trfXX'^) — /tr(A) — mti(B)) 
mf 

mf \ rnf f J 

ti{ZiAZf) tr(A) 

mf m 


By constructing a new matrix Aj = If 0 A which is block diagonal with / identical submatrices A along 
its diagonal, we prove the following large deviation bound: for ti = CqK'^ II^IIf VT^^ogm and / > log m. 


{\tT{ZiAZj) — /tr(^)| > fi) = P ^ vec { Zi I"*" (I ® A)vec { Zi } — ftr(A) 

ti \\ 


>ti 


< exp —cmin 


< 2exp cmin 

< 2exp (—41ogm) 


K^WAfWy K^WAfW^^ 

( (CoAT^V/ logm||A||^)2 CoiT^V/logm li^np, 




K^f 


iT2 


where the first inequality holds by Theorem 26 and the second inequality holds given that ||Ay||p, = / ji^np, 
and ||Ay ||2 = ||A|| 2 . Similarly, by constructing a new matrix Bm = Im ® B which is block diagonal 
with m identical submatrices B along its diagonal, we prove the following large deviation bound: for 
t 2 = CoK"^ II^IIf logand m > 2, 

P {\tv{Z 2 BZ 2 ) — mtr(i?)| >t 2 ) = P vec { Z 2 j"*" (Im ® B)vec { Z 2 } — mtr(B) > t 2 ^ 

< exp —cmin 


t-2 


t 2 

K^mWBWi' K‘^ ||B| 


( .({CQK‘^y/m\ogm\\B\\p)‘^ CQK‘^y/m\ogm\\B\\p 

< 2exp —cmin ' 


< 2exp (—41ogm) . 


K^m\\B\\], 


\\B\ 
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Finally, we have by (71) for to = (7oi^^y^tr(]4ytr(5yk)g7n, 


vec { Zi ^ A^/\ec { Z 2 } I > fo) 

( f fl 

< 2 exp —cmin 


to 


||bV 2 0 y4V2||^’ iF2 ||bV2 0 aV2|| 


. / (C'oy/tr(A)tr(B) log m)^ C'oy/tr(A)tr(B) log 

= 2 exp —cmin -——--,- -j- -- 7 ;^ 

' ' tr(A)tr(B) 1IRI1V2 II „||i/2 


m 




\B 


< 2exp(—41ogm) 

where we used and the fact that r{A)r{B) > logm, ^ ^ 


and 


^ 1/2 ^^ 1/2 ^ tT{{B^/^^A^/^){B^/^^A^/^))=tT{B^A)=tT{A)tT{B). 

F 


Thus we have with probability 1 — 

j |tr(B) - tr(B)| = ^ |tr(XX^) - /tr(A) - mtr(B)| 


< 


mf 

+ 


vec { Zi l"*" (g) A^/^)vec { Z 2 } 


tr(Z2^BZ2) 

tv{B) 

1 

tr(ZiAZ^) tr(A) 

mf 

f 


mf m 


< ^(2fo + fi + f2) = ^^^CoiT2 

mf y/mf 

^ r,^ \/log m ^^2 r\ _ 7~1 

— ^^0 t -?r ^ T^m,m 


+ ‘^y/TATB + 


\B\ 


Vf 


- Di = and 

Vm Af 


y/mj 

where recall rm,m = 2Co^^^ 

2 y/TATB <TA + Tb < 

To see this, recall 


+ 


\B\ 


m 


V7 


mTA = '^Xi{A) < ^/mC^\‘^{A)y/'^ = y/m\\A\\p (75) 

i=l i=l 

f f 

fTB = ^X^{B)<y/fi^XfiBF^‘' = ^/f\\B\\F 

i=l i=l 

where Xi{A),i = 1,. .., m and Xi{B), i = 1,..., / denote the eigenvalues of positive semidefinite covari¬ 
ance matrices A and B respectively. 

Denote by Bo the following event 

|i |tr(B) - tr(B)| < DiK'^Vm^rn^ 
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Clearly ti{B) := (tr(i?))+ by definition (4). As a consequence, on Bq, tr(i?) = tr(i?) > 0 when tb > 
DiK‘^rm,m\ hence 

j |tr(S) - tr(S)| = j |tr(5) - tr(5)| < DiK‘^rm,m- 
Otherwise, it is possible that tr(i?) < 0. However, suppose we set 

tb ■■= jtr(H) := j(tr(H) V 0), 

then we can also guarantee that 

\tb - tb\ = \tb\ < DiK‘^rm,m in case tb < DiK‘^rm,m- 
The lemma is thus proved. □ 


D Proof of Theorem 8 

Denote by f3 = (3*. Let S := supp (5,d= IS"! and 

v = P-p. 

where /3 is as defined in (6). We first show Lemma 30, followed by the proof of Theorem 8. 

Lemma 30. Bickel et al. (2009); Loh and Wainwright (2012) Suppose that (34) holds. Suppose that there 
exists a parameter i)) such that 


s/dr < 


^ / log m 


, and X > AijjA 


'logm 


^oV / 

where bo, A are as defined in (6). Then ||r; 5 c ||^ < 3 ||t; 5 ||;^ 
Proof. By the optimality of /3, we have 




Hence 


, we have for A > 


> {y,v) 


1 


= -vrv+ {v,rfi) - 
= -vFv — {v,j — rfi) 


-vTv < {v,y-Tf3) + A^ 


/3 


< A, 


/3 


+ 




(76) 
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Hence 


vTv < X„ {2 ||/3||^ - 2 


< A, 


- 2 


+ 2V^4 


'logm 




(77) 




< A„-(5||us||i-3||i;5.||i) 


(78) 


where by the triangle inequality, and = 0, we have 


- 2 


/3 


1 „ „ 

-\— k; L = 2 
1 2 " 


- 2 


^ - 2 ll-u^^lli + - llr^slli + 2 ll'^S'^lli 


< 2||t,s||,-2||TO.|l, + l|l«slli + l||TO.|l, 
1 


< §(5|ks|li-3||r;sc||J. (79) 

We now give a lower bound on the LHS of (76), applying the lower-RE condition as in Definition 2.2, 

v^^Tv > a ||u||2 — T ||r;||^ > —r ||t;||^ 
thus — r;^rr; < ||t;||^ r < ||r;||^ 26oA/dr 


< ||r;||^26o7- 

bo 


/ log m 


f 


= ||r;||^ 2'i/)4 


I logm 

~T~ 


< oA(||r;5|li + lk5Hli) 


(80) 


where we use the assumption that 


'/dr < 


Ip / log m 


j. , and ||r;||^ < 


/3 


+ ||/3|li < 2boVd 


which holds by the triangle inequality and the fact that both /) and j3 have norm being bounded by boVd. 
Hence by (78) and (80) 


Thus we have 


0 < -r;fr; +^A||i;5||i - ^A||r;sc||^ 

< ^A||r;5||i + ^A||r;5c||i + ^A||r;5||i-^A||r;5c||i 

< 3A||r;s||i - A||t;sc||^ 


\\vs4i < 3||us||i 


(81) 


( 82 ) 


Thus Lemma 30 holds. □ 
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Proof of Theorem 8. Following the conclusion of Lemma 30, we have 

\\v\\i < 4||r;5||^ < 4 :Vd\\v\\2. 

Moreover, we have by the lower-RE condition as in Definition 2.2 

v'^Tv > a ||r;||2 — T ||r;||^ > (a — lOdr) ||r;||2 >-a ||f||2 

where the last inequality follows from the assumption that IGdr < aI2. 

Combining the bounds in (84), (83) and (77), we have 


^a||w|l 2 < v^Vv<\n(2\ 


- 2 




< -Allu^lli < 10AVd||r;||2 

And thus we have ||r ;||2 < 2f)\y/d. The theorem is thus proved. □ 


( 83 ) 


(84) 


D.l Proof of Lemma 9 


In view of Remark D.l, Condition (40) implies that (52) in Theorem 20 holds for ( = sq and e = 

Now, by Theorem 20, we have Vu, v € E D under (Al) and (A3), condition (48) holds under event 

.Ao, and so long as mf > 1024Cg772iF^ log m/Amin(^)^> 


\u^Av\ < 8Cw(so)s + 2CoD2K‘^ir-^^^ 

mf 


=: 6 with 6 < ^Amin(2l) < ^ 
o o 


1 AjxLin(A) 

which holds for all e < - - 


1 1 

< 


2 640^(50) ■ 2Ma ~ 128C' 
withP(A.o) > 1 — 4 exp —2 exp (^—C 2 e'^^'^ —6jm^. Hence, by Corollary 19, V0 G R™, 

9'^fA0>a\\9\\l-T\\e\\l and 9'^f a 9 < a \\9\\l + t \\9\\l 
where a = ^ Amin (A) and a = |Amax(A) and 

512C^ri7(so)^ log m a 2a 

< T = - < 


< 


Amin (A) f ' So So + 1 

1024(7^^^(30 + 1) logm 


Amin(A) / 

where we plugged in so as defined in (12). The lemma is fhus proved in view of Remark D. 1 . □ 
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Remark D.l. Clearly the condition on tr(i?)/ ||-B ||2 stated in Lemma 9 ensures that we have for e 


and hence 


,2 tr(^) 

K^\\B\ 


> 


> 


e 


log 


3em 

SqE 


1 


Ac'K^M'fso log 




6 emMj^ ^ 


■So 


> c'so log ( 


f GemMA'^ 


V ■So 


exp 



tr(-B) ^ 
^^11^112; 


< exp ( -c'c 2 So log 

4/ 


f 6 emMA\ 


[- 


So 


)) 


exp 


-C3 


M\ log m 


log 


3eM^m log m 

2 / 


D.2 Proof of Lemma 10 


Let 


M+ = + 4) + 1) = Pmax(s0 + 1, -4) + Tfi =: D 


By definition of so^ we have 

Vso + lw{so + 1) > 

So + 1 > 


Amin(^) / / 


and hence 


32C y logm 

AL..(-4) / (- a y / ^ 1_ L 

1024(72-072(50 + 1) logm \16CL>/ logm log 


m 


The first inequality in (33) holds given that M+ < 2Ma and hence 

1 / 


d < 


< 1 / < So + 1 ^ £0 

^ ' 64-32 


64M^ log m 16Ml log m 


Moreover, for D = pmax(so + 1, ^) + tb < -D 2 and C = Cojs/d, we have 


, ^ r. >rs f ^ 1 (CoD2\^^ 

d _ \CD ) 


f 


log m 


< - 


1 / 1 


2 V 16CD 




f 


M\ log m 


< 


1 (so + l/\ogm L /\ ^ (so)^ logm ( / 

2 a2 / V^o ' 


o;2 / y^o 


1 

2AfA 
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where assuming that sq > 3, we have 

^ mV 

o? “ \ a / “ {IQCDY Vlog"i/ 

= AClDl^{M, + Km\^f (85) 

> AClDlD^ = ACiDl • 


We have shown that (42) indeed holds, and the lemma is thus proved. □ 

Remark D.2. Throughout this paper, we assume that Cq is a large enough constant such that for c as defined 
in Theorem 26, 


cminjCo, Cq} > 4. 


( 86 ) 


By definition of sq, we have for w‘^{so) > 1, 


. ( A) f 

soTU^(so) < . ;- and hence 


So S 


1024(70 logm 


1024(7q logm 1024 Cq logm 
Remark D.3. The proof shows that one can take C = Colsfd, and take 


=■■ So- 


V = 3eMl/2 = 


Hence a sufficient condition on r{B) is: 


3e64^C^w^{so) ^ 3 e 64 ^( 7 QtJ 7 ^(so) 


2Ai,.(/l) - 2(c')3/ni,.(.4)^ 


r{B) > IQc'K 


•' ^ ^3 log + log ^ 


logm 




2 / 


(87) 


It remains to prove Lemmas 14 and 15. 


Proof oi Lemma 14. Suppose that event Bq holds. By (74) and that fact that 2Di := 2( 




IISII 


^) < 2 {\\A\\\/^ + ||5||2^")(V^+ v^) < 77oracie77o, where recall D' = \\B\\IJ^ 

1 „ „ 


|V2^ 


|V2 , ^1/2 




< Z?'iLrV||n|2r^j + 2L»iiL- 

oo \/Tn 


, ^m,f 


< DoK\ 


\2 'm,f \ 


1/2 Tloracle 1 


fn + 


m / 


+ DoM^rm,f 


< ^o(rr + 

The lemma is thus proved. □ 


1/2 , D oracle 


m J 




+ 
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Proo/of Lemma 15. Recall that we require 


/ 


I ^ 119 


d < Ca A 2| ^ ^ where 


where = 


logm 

l2^ 


P)2 

I 2 > #0- 


The proof for d < so/32 follows exactly that of Lemma 10. In order to show the second inequality, we 
follow the same line of arguments where we need to replace one inequality. By definition of D'q, we have 
II.BII 2 + Umax < {D'of < 2(||P||2 + Umax)- NoW SUppOSC that for 

, _ >r. f f f CoD'oV r. 

d .— Cac C(f)- -< Ca-, -I „ „ 1 Dfj) 

logm logm \ CL> J 

where 1 < D = pmax(so +tb < D 2 and C = CqI^TC. 

, ^ ^ . 1 fCoD'o\\ f 

^logm 128M^ \ CD ) ^ 


logm 


1 / 1 
< - 


2 V 16CD 




m 


< 


1 (so + 1)^ log m f 'll} \ (sq)^ log m f 'll} 


2 a2 


/ V&o 


< 2 

r\j^ 


o? f \bo 


where assuming that sq > 3, we have the following inequality by definition of sq and a = Amin(^)/2 


2s^ 


> 


a 


2 — 


So + 1 


a 


> 


a 


f 


{IQCDY vlognr 

which is identical in the proof of Lemma 10, while we replace (85) with 


^C'^iD'^YD^ = ACl{D^,Y{^^+rYKU) 


< 4Cl{D',Y^{M, + Tp^K\\n\2) 


2 1 - 


where := 


> := ^ and Y = 2Co ^ ||^*||^ ^ DoM,K^ as in (44). □ 


E Proofs for the Conic Programming estimator 

E.l Proof of Lemmas 11 and 13 

We next provide proofs for Lemmas 11 and 13 in this section. 
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Proof of Lemma 11. Suppose event Bq holds. Then by the proof of Lemma 6, 




7-r/3- 


< 2CoD2K^\\I5*\\^ 




f 


f 


=: /i 11/3*112+ r 

The lemma follows immediately for the chosen /r, r as in (43) given that (/3*, ||/3* Hg) S T. □ 
Proof of Lemma 12. By optimality of (/3, t), we have 


P 


+ A 


P 


< 


P 


Thus we have for 5 := supp (/3*), 


P = Ps<^ 


Now by the triangle inequality, 


+ 


Ps 


Ps^ 


= 11^5=111 < 


+ Af < iinii +Ai 


< lirili + A(||^*||2- P 


Ps +A(||/3*||2- P ) 


< Ik5|li + A(||/3*||2- P 
^ Ikslli + A(||/3*||2 — Ps 
= Ikslli + AIIU5II2 < (1 + A) ||u5||i 


The lemma thus holds given 


* - A 




1 

A 


I 2 < T + 


Proo/of Lemma 13. Recall the following shorthand notation: 

Do = (V^ + VP max ■) andDa = 2(||yl||2 + PUs) 

First we rewrite an upper bound for u = /3 — /3*, D = tr{B) and D = tr(i3) 

= (X-WfXoiP-P*) < X^Xo{p-P*) +\\W^Xov\\ 


< 


+ 


X^{XP-y)-Dp +||X^e|| + {X^W - D)P 

00 ^ 

{D-D)P +||lL^Xor;||^ 
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where 


x^xo(/3 - r) 


< 


< 


X^{Xop -y + e) 


X^{{X-W)d-y) +\\X^ 

oo 

X^{XP-y)-Dp +\\X^. 


+ 


{X^W-D)P 


OO 

+ 


On event Bq, we have by Lemma 12 and the fact that P £ T 


I := 


1-W = jX^iy - XP) + \DP 

< Ill'll! + II/ 5 II 2 ) +7' 


[D - D)P 


< yt + T 


and on event 


= 2 D 2 Krraj{^ IMIi + 11/3*112) + DoVmjM^ 


< rmjMPalll^ + y/TE) = D^r^jM^ 

Thus on event Bq, we have 

I + II < 2D2Krmj{^ ||u||i + 11 / 3 * 112 ) + 2DormjM^ = n{{j ||ti||^ + \\P*\\ 2 ) + 2r. 
Now on event Be, we have for 2Di < D 2 


IV := 


(D - D)P 


< 


D-D 


P 


00 \/Tn 


< D2K—rrnj{\\P*\\2 + \\v\\P 


On event B^ 0 ^ 10 , we have 


III := 4 


(X^W-DW < },\\{xnv - D)e"\\^ +!j\\{X'^W - D) 


BZ - tr(B)/„)||„„ + ) IIXj-»F|L, 

+ v?^»;/i)(iMii + iirii2) 


+ 7 


< fmjK 


^/7 


and y = i \\W^Xov\\^ < j ll^^lli < r^jKy/^aUl ||?^||i • 

Thus we have on Bq n Bio, for Do < D 2 and r^i = 1 

III + IV + v < TmjK ^||i ?||2 + Tb + Umax + '^^(11^112 + I|f3|l2)^ (Il'T’lli + ||/3*||2) 

< r^jK{4\\B\\2 + 3\\A\\2)i\\v\\, + \\P*\\2) 

< 2I)2i^r^,/(|klli + ll/3*ll2) 

< Mllull,+ 11 / 3 * 112 ) 
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Thus we have 


jXjXov 


OO 


< I + II + III + IV + V 

< fJ-ij ll^^lli + II/ 5 II 2 ) + 2DoM^rmj + niWvWi 

^ 2/x 11/3*112 +/^(—+ 1) ||u||]^ + 2 t. 


The lemma thus holds. □ 


2 ) 


F Proof for Theorem 5 


We prove Lemmas 16 to 18 in this section. 


Proof oi Lemma 16. Suppose event Bq holds. Then by the proof of Lemma 14, we have for Dq = 
+ aUL and where Dorade = 2 (||.B|| 2 '^^ + 


< D'QTp^Krmj 11 / 3*112 + DoM.rmj. 


7-r/3 

The lemma follows immediately for /r, r as chosen in (45). □ 


Proo/of Lemma 17. We first show (46) and (47). Recall Vm^m ■= > 2Co *°^J^ ™' . By 

Lemma 7, we have on event Bq, 


\tb-tb\ < DiK^rr, 


Moreover, we have under (Al) 1 = < Z?i := 


+ 


ml/2 /1/2 


in view of (75). And 


Di < 


+ 11-^112 ^ ( 


3^oracle \ 2 


and hence 


V^i< 


-^oracle 


= \\B 


11/2 


+ 


1/2 

2 


By definition and construction, we have tb,tb > 0, 


^1/2 1/2 


B 


^1/21/2. 


and hence 


Thus, on event Bq, we have 


d/2 1/2 


< 




= \TB — Tb\ 


d/2 1/2 


< V\^b-tb\ < 


Thus we have for Ce > P>orade > and Z^orade = 2(|| 


1/2 Z^orade 1/2 ^ 1/2 ^ ^1/2 Z/orade 1/2 

D-^—Zfr^/^ < Tjf < Tjf + —^ 


'B 


( 88 ) 
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+/2 


Thus we have for as defined in (23), (88) and the fact that 

m •= \/2Co ^ — > Ij^pm for m > 16 and Cq > 1, 

’ - / mr> 


the following inequalities hold: for K >1, 


(89) 


< rr + 


1/2 -Ppracle , -Poracle ^l/2 


< + Dor^ieKrlll^ < 


mm — ' B 


—1/2 -^1/2 1/2 

where the last inequality holds by the choice of + Dora.cieKrmm as in (30). Moreover, we have 

on event Bq, by (88) 

:^V2 


^ := ry^ + C,KrUl<Tl^^ + ^^ Krill+ CeKrUl 


< rH^ + lc^Krlll 


tb 


‘B • 2' 

:= [r]/^ + C,Krlllf <2 tb + 2CIk\^ 

"y2 

' mm 


< 2 tb + ‘^DiK‘^rm,m + ‘^CaK'^rn 


< 2tb + 


D. 


oracle 


K^fm,m + ^CgK'^rmm ^ 2 tb + SC^K'^r, 


2e-2„ 


and thus (46) and (47) hold given that 2Di < Dl^^^/2 < (71/2. Finally, we have 

1/2 3^ „ 1/2 

~l/2 - < / 1/2 2(-„^^1/2 n - < +2<76Al-mm 

for as defined in (26). □ 

Remark F.l. The set T in our setting is equivalent to the following: for fi, r as defined in (30) and /3 G R™, 


T = 


{(/3,f) : ||ix^(y-X/3) + irr(R)/3 


< fit+ T, 


I 2 — ^ I 


(90) 


Proof of Lemma 18. For the rest of the proof, we will follow the notation in the proof for Lemma 13. 
Notice that the bounds as stated in Lemma 12 remain true with r, fi chosen as in (45), so long as (/3*, ||/3* II 2 ) G 
T. This indeed holds by Lemma 16: for r (29) and fi (30) as chosen in Theorem 5, we have by (89), 

fi X D'fh]l‘^Krmj > D'oKrmjTp'^ 

where = {y/rs + )» which ensures that (/3*, ||/3*||2) G T by Lemma 16. 

On event Bq, we have by Lemma 12 and the fact that ^8 G T as in (90) 


I + II := 
< 


7-f^ +^\\X^€\\ 

^XT{y-XP) + jDP 


+ r < /it + 2t 


— lliilli + 11 / 5 * 112 ) + 2t 
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for /X, T as chosen in (30) and (29) respectively. Now on event Bq, we have 


IV := 


{D - D)f5 


< 


< D'o 


D-D 

I .f^oracle 


1 


+ Halloo) 


/3 


Krm,f{\\/3*\\2 + ll^^lli) 


where 2 Di < £>oracie^o for 1 < D'^ := + aUL and Dorade = 2 (||.B|| 2 '‘ + 

|V2^^1/2 


11/2 


Umax > Tyi = 1 Under (Al). Hence 

III + IV + V< rmjK^ (||H||f + a]lL) (||Hli + W II2) 

+2DiK-^rmj{\\l3*\\2 + \\v\\i) + rmjKy/TEa]ll^ ||u||i 
Vm 

< D'^Kr^j{\\v\\^ + \\ 

< D'QKr^jT'^/‘^{\\v\\^ + 11/3*112) + £>o^r^jv^lblli 


2)(0^ + ||u||i 


m 


< CoD'^K^-^ir^J^ + ^^){2 llulli + ||/3*||2) 


< M2iiiiiii + iirii2) 

for fjL as defined in (30) in view of (89). Thus we have 

I + II + III + IV + V < /x(i||u||i + 11 / 3 * 112 )+ 2r + /i(2||u||i + 

A 


— 2/x((l + —) llullj^ + 

and the improved bounds as stated in the Lemma thus holds, n 


2 ) + 2t 


1/2 

2 


where 


G Some geometric analysis results 

Let us define fhe following set of vectors in R™: 

Cone(so) := {v : ||r;||i < v^l|ii|l 2 } 

For each vector x G R™, let To denote the locations of the so largest coefficients of x in absolute values. 
Any vector x G 5*"“^ satisfies: 




0 M 00 


< lla^Tolli/so ^ 




(91) 


We need to state the following result from Mendelson et al. (2008). Let be the unit sphere in R™, for 
1 < s < m, 

Us := {x G R'" : | supp(x)| < s} (92) 

The sets Ug is an union of the s-sparse vectors. The following three lemmas are well-known and mostly 
standard; See Mendelson et al. (2008) and Loh and Wainwright (2012). 
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Lemma 31. For every 1 < sq < m and every I C {1, ..., m} with |/| < sq. 

p ^m-i ^ 2 conv {Uso n 5™-^) =; 2conv I |J n j 

\\J\<so / 

and moreover, for p E (0,1]. 

n pBif C (1 + p) conv [Us, n Bff) =: (1 + p) conv | |J n 


Proof. Fix X € R"^. Let xtq denote the subvector of x confined to the locations of its sq largest coefficients 
in absolute values; moreover, we use it to represent its 0-extended version x' G such that xf^ = 0 and 
x'j,^ = xtq- Thr'oughout this proof, Tq is understood to be the locations of the sq largest coefficients in 
absolute values in x. 


Moreover, let (0^1 be non-increasing rearrangement of i\xi\)r=v Denote by 


L = ^B^npBf^ 

R = 2conv I y I = 2conv (^nR^) 

Any vector x G R™ satisfies: 

- II^Tblli/■So < (93) 

If follows fhaf for any p > 0, sq > 1 and for all z G L, we have fhe largesf coordinafe in absolufe value 
in z is af mosf / i, 


sup {x,z) < max ( xtq,z) + max ( xt§ ,2) 

zeL l|2|l2<P lklli<\Ao 

< pI|2;toII2 + 

< \\xTo\\2iP + ^) 

where clearly max||^||, 2 <p{^To,z) = And denote by S'^ := S'^ ^FEj, 


sup {x,z) 

z&R 


(1 + p) max maxfxjz) 
J:|J|<S0 2eS-^ 

(l + p)||xroll2 


given fhaf for a convex function {x,z) , fhe maximum happens af an exfreme poinf, and in fhis case, if 


happens for z such fhaf 2 is supporfed on Tq, such fhaf ztq = 


^Tn 


\XTr 


-, and zt;^ =0. □ 


0 Il2 


Lemma 32. Let 1/5 > 5 > 0. Let E = U| j|<s(,Rj/or 0 < sq < m/2 and ko > 0. Let A. be a m x m 
matrix such that 


\u^Av\<S yu,v G EnS^-^ 
Then for all v G (y/soR™ H R™), we have 

|u^Ar;| < 4(5. 


(94) 

(95) 
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Proof. First notice that 


max \v 


T 


Ar;| < 


max 


to,ne nB^) 


\w 


Au\ 


(96) 


Now that we have decoupled u and w on the RHS of (96), we first fix u. Then for any fixed u G S'™ ^ and 
mafrix A G R™^™, f{w) = |t(;^An| is a convex function of w, and hence for w G H B™) C 


2conv(U|,,|<,„BjnS™-i), 


2 max Irc^Aul 

T^Econv {EnS'^~^) 

2 max Im^Aril 

weEnS’^-'^ 

where fhe maximum occurs af an exfreme poinf of fhe sef conv (B H 5*™“^), because of fhe convexify of fhe 
function f{w). 

Clearly fhe RHS of (96) is bounded by 

max |m^An| = max max |u)^Au| 

u,we[^BY‘nB^) ue[^BY^nB^) nsy*) 

< 2 max max |r(;^AM| 

= 2 max g{u) 

ue{^BY^nB^) 

where fhe function 5 of rt G (y/soB™ n B™) is defined as 

g{u) = max |t(;^Au| 


max '^Au\ < 

we{y^BY^nB^) 


which is convex since if is fhe maximum of a function fw{u) ■= |m^Au| which is convex in u for 
each w £ {E n S™“^). Thus we have for u G (y/ioB™ n B™) C 2 conv (^U|j|<so =: 

2 conv (BnS'™-^) 


max g{u) 


< 2 max g{u) 

uSconv 


= 2 max g(u) (97) 

ueEnS"*-i 

= 2 max max Itc^Artl < 46 (98) 

liSEnS"*-! 


where (97) holds given fhaf fhe maximum occurs af an exfreme poinf of fhe sef conv {E n B™), because of 
fhe convexify of fhe function 5 r(u). n 

Corollary 33. Suppose all conditions in Lemma 32 hold. Then \/v G Cone(so), 

|r;^Ar;| < 4(i||r;||2. (99) 
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Proof. It is sufficient to show that Vu G Cone(so) H 5™“^, 

|z;^Ai;| < A5. 

Denote by Cone := Cone(so). Clearly this set of vectors satisfy: 


Cone n C (v^^r n Bf^) 


Thus (99) follows from (95). □ 

Remark G.l. Suppose we relax the definition o/Cone(so) to be: 

Cone(so) := {v '■ ||^|li < 2-y/io ||r;|| 2 } 
Clearly, Cone(so, 1) C Cone(so). given thatMu G Cone(so) 1). we have 

ll^illl < 2 llttTolli ^ 2v^||uToll2 < 2v^ ||lt|l2 
Lemma 34. Suppose all conditions in Lemma 32 hold. Then for all v G R”*, 


v'^Av\ <45(||r;||^ + —||r;||2) 


( 100 ) 


Proof. The lemma follows given that Vu G R"^, one of the following must hold: 



( 101 ) 


( 102 ) 


leading to the same conclusion in (100). We have shown (101) in Lemma 32. Let Cone(so)'^ be the com¬ 
plement set of Cone(so)'^ in R"^. That is, we focus now on the set of vectors such that 

Cone(so)^ := {v ■ ||w||i > Il'^ll2} 

and show that for u = 



where the last inequality holds by Lemma 32 given that 



and thus 



1 

— sup 
■*0 



□ 
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H Proof of Corollary 19 


First we show that for all v G R”^, (103) holds. It is sufficient to check that the condition (94) in 
Lemma 32 holds. Then, (103) follows from Lemma 34: for v G R™, 

\v^Av\ < 4 ( 5 (||r ;||2 + ^ ||r;||i) < ^Amin(A)(||r ;||2 + ^ ||r;||i). (103) 

The Lower and Upper RE conditions thus immediately follow. The Corollary is thus proved. □ 


I Proof of Theorem 20 

We first state the following preliminary results in Lemmas 35 and 36; their proofs appear in Section K. 
Throughout this section, the choice of C = Cojy/d satisfies fhe condifions on C in Lemmas 35 and 36, 
where recall min{C'o,C'Q} > 4/c for c as defined in Theorem 26. For a sef J C {!,... ,rn}, denofe 
Fj = j where recall Ej = span{ej : j G J}. 

Lemma 35. Suppose all conditions in Theorem 20 hold. Let 

E= [J EjnS'^-^ 

\j\=k 


Suppose that for some d > 0 and £ < where C = , 


(104) 


Then for all vectors u,v ^ E Ci on event Bi, where P (. 81 ) >1 — 2 exp C 2 e^ for C 2 > 2, 

\u^Z'^BZv-Eu^Z'^BZv\ < 4Cetr(R). 

Lemma 36. Suppose that e < 1/C, where C is as defined in Lemma 35. Suppose that (104) holds. Let 

FJ = Ej and E= Fj. (105) 

\J\=k \J\=k 


Then on event B2, where P {B2) > 1 — 2 exp C2g^ j^^4||g|| ^ for C2 > 2, we have for all vectors u G 
EnS^-^ andw G F n 


uFzIb^/‘^Z2U 


- , f I1/2 - 


where Z \, Z 2 are independent copies of Z, as defined in Theorem 20. 

In facf, fhe same conclusion holds for all y, m G F n and in parficular, for B = I, we have fhe 

following. 
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Corollary 37, Suppose all conditions in Lemma 35 hold. Suppose that F = for E as defined in 

Lemma 35. Let 




(106) 


Then on event B 3 , where P {Bf) >1 — 2 exp (—C 2 e^/-^), we have for all vectors w,y E E Ci 5™ ^ and 
£ < IjC for C is as defined in Lemma 35, 


y'^i^Z^Z - I)w 


< ACe. 


(107) 


We prove Lemmas 35 and 36 and Corollary 37 in Section K. We are now ready to prove Theorem 20. 
Proof oi Theorem 20. Recall the following for Xq = ZiA^^'^, 

A:=fA-A:= jX'^X - j:tj:{B)L^ - A 
= (jx;[Xo -A) + ^{W^Xo + XjW) + ){W^W - tT{B)L^). 

Notice that 

w^{TA — 3i)v = \u^ {X'^ X — ii{B)Im — 3i)v\ 


< 

< 


u 

u 

+ 


^{^X^Xo-A)v + u^^W^Xo + XjW)v + u^i^W^W -^-^L^)v 
^A^/^^ZfZiA^/^v-u^Av + u^j{W^Xo + X^W)v 

{jZjBZ 2 - TBlm)v + j |te(.B) - tr(5)| \u^v\ =:/ + // + LLL + TV. 


.T/l 


For u € E n S'^ define h{u) := 


||Al/ 2 i 


-. The conditions in (104) and (106) hold for k. We first bound 


the middle term as follows. Fix u,v ^ E Ci ^ Then on event B 2 , for T = ZT B^I’^Z 2 , 


\u^{W^Xq +X^W)v\ = 

u^Z'^B^/'^ZiA^/^v + u^A^I‘^zIb^I‘^Z 2 V 

< 

u^T^h{v)\ 

A^/\ 

^ + \h{u)'^Tv\ 

A^/^u 


< 2 max 


< 


8Crtr(i7) • 


w^'^'^\pUL{k,A) 

1/2 


We now use Lemma 35 to bound both / and ILL. We have for C as defined in Lemma 35, on event i3i n 1 ^ 3 , 

\u^{Z^BZ 2 - tT{B)Lm)v\ < ACetriB). 

Moreover, by Corollary 37, we have on event B 3 , for all u, n G E D 5™“^, 


u 


T (1 


- Afi 


u 


^A^/'^Z'^ZA^/'^v - u^Av 


h{uf{^Z^Z - I)h{v) 


A^l‘^1 


A^/\ 


< j max^ \uF{Z"^Z - I)y\ pmax(fe, A) 

Fl A:C £ Pj^ny^(k , A3j . 
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Thus we have on event BiCi B 2 r\ Bs and for tb ■= tv{B )// 


/ + // + III 


^ Ar< ( n /t\ I o f Praa.x{k,A)^ ^ ^ 

< ACe I Pma.x{k, A) + 2tb ^—|j^g|j- j + Tb 

< 8C£{tb + Pm<,x{k,A)) . 


On event Bq, we have for Di as defined in Lemma 7, 

IV < 

The theorem thus holds by the union bound. □ 


\tb-tb\ < 2CqDiK‘^ 


llogm 

fm 


J Proof for Theorem 21 


We first state the following bounds in (108) before we prove Theorem 21. On event A 2 , where P (^ 2 ) > 

4C£ti{A) 


l-2exp(-C3e2-^:j^ 


yu,wGS^-^ u^ZiA^/^zT 


w 


< 


1/2 • 
2 


1 /2 

To see this, first note that by Lemma 27, we have for t = Cetr(A)/ \\A \\2 and e < 1/2, 


( u^ZiA^/‘^Zlw 


> t] < 2exp ( —cmin 


V 


< 2 exp I—cmin (C^, 2(7) 


. f C‘^£:^tr(A) C£iv{A) \ 


2 ’ 11^112 


} 


e^tr(A) 
K4 


where recall 


C = cc' min (2(7, (7^) > 4. 


(108) 


Before we proceed, we state the following well-known result on volumetric estimate', see e.g. Milman and Schechtman 
(1986). 

Lemma 38. Given m > 1 and e > 0. There exists an e-net 11 C of Blf with respect to the Euclidean 

metric such that Blf C (1 — e)~^ conv 11 and |n| < (1 + 2/c)”*. Similarly, there exists an e-net of the 
sphere S^-^, U' C such that |n'| < (1 + 2/e)™. 

Choose an e-net 11 C such that |n| < (1 -I- 2/e)-^ = exp(/log(3/e)). The existence of such 11 is 
guaranteed by Lemma 38. By the union bound and Lemma 27, we have for some (7 > 2 and c' > 1 large 
enough such that 


3u, w € IIs.L 


u 




w 


> Ce 



< 


2 exp 


(_ ehijA) \ 

( ^k‘\\a\\J- 


Hence, (108) follows from a standard approximation argument. 
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Lemma 39. Let e > 0. Let Z as defined in Definition 1.2. Assume that 


Then 


tr(^) ^ , log(3/e) 

Pll - ^ 


3x G 


-1 


-(tr(A))i/2 >e(tr 


< exp —c£ 


! fa(A) 


Proof. Let x G Si-^. Then Y = Z'^x G R”* is a random vector with independent coordinates satisfying 
KYj = 0 and for all j G 1... m. The last estimate follows from Hoeffding inequality. By 

Theorem 2.1 Rudelson and Vershynin (2013), 


yll/2y ^ _ (tr(A))l/2 > 


< exp —c£ 


! fr(^) 


Choose an e-net IT C 5^ ^ such that |n| < (3/e)f. By the union bound and the assumption of the Lemma, 


3x G n 


A^/'^Z'^x ^ - (tr(^))^/^ > e(tr 


< |n| • exp —ce 


! tr(2l) 


1/2 tr(^) 

< exp —c e 




A standard approximation argument shows that if III II 2 — (tr(A))^/^| < e(tr(A))^/^ for all x G If, 

then III A^/^Z^x ||2 — (tr(A))^/^| < 3e(tr(A))^/^ for all x G 5^“^. This finishes the proof of the Lemma. 

□ 


Proof of Theorem 21 . First we write 

XX^ - tr{A)If = {ZiA^/^ + B^/^Z 2 ){ZiA^/‘^ + - tr(A)/y 

= ^ B^I’^Z2){zIb^I’^ + A^/^Zf) - tv{A)If 

= ZiA^/ 2 yT^i /2 ^ b^/^Z 2 ZJb^/^ + B^/^Z 2 A^/^Zf + ZiAZf - tr(A)//. 


Hence, 


vf{XX'^)u u^ii{A)Lu rp 
- u Bu 


m 


m 


< 


LuTz,AZfu-^^u^u 

m m 


+ 


-u^B^^^Z2ZJb^/^u - u^Bu 


m 


+ 


m 


ZiA^/^ Z^ B^/^ 


u 


where by (108), we have on event A 2 , for ta '■= and w := 


Bl/2i 




m 


< 


8C'etr(A) ||R^/^t 


m 


i^ZiA^/'^ZTw B^l’^i 


1/2 


=: SCeta 


m 


JUtP. 
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Moreover, by the union bound and Lemma 39, we have on event Ai, where P (,4.i) > 1 — exp(ce^-^) — 

! tr(A) s 

1 


exp(ce^ 


( 1 - 6 ) 




< 




u 


<(1 + ^) 


i?V2 


u 




m 


A^/^Zfu <(l + e)^^i4^. 

2 A/m 


Hence on event Ai, we have 


1 

m 

1 

m 


A^^Zfu 

— tr(A) 

2 

zIb^Au 

2 T 

— vA Bu 
2 


< max((l + 6)^ — 1,1 — (1 — £f) 


< max((l + 6)^ — 1,1 — (1 — 6)^) 


2^tr(^) 


m 


ijl/2 


U 


Thus we have for all u ^ on event Ai H A 2 , for C 2 '■= 4(7 + 3 

' T' 

-u — u Bu 


1 TtwT\ 

— U [XX^ )u - U —■' ' 
m m 


< 


< 


zIb^I’^u 


‘ / T' 

/m — u Bu 


1 

H- 

m 


A^/'^Zfu -tr(^) 


+ SCeta 




u 


/Pll 


1/2 

2 


< 3e 




u 


+ 3er^ + SCeta 


ijl/2 


U 


/PIlP < C,e 




where <ta + The theorem thus holds. 


+ C 2 ETA 


J.l Proof of Corollary 22 

Lower bound: For all u € and 


^ T(vvT\ 

—u {XX )u — u - - 

m m 

.T 


u 


> Bu{l — 3e) — 3eta — 8(7 


hV2 


u 


^ta! 


1/2 

2 


> Bu{l — 36 — ACe) — 3eta — ACeta 

> u^Bu{l — C 2 E) — C 2 ETA > Bu{l — 26) 

where we bound the term using the fact that 1 <ta < Amax(f?) and 

C2TAe < 6Xm\n{B) and 6 < (iAmin(-B)/(C' 2 TA) 

C2E < 6 and C3E < (5 min (1 

V TA 


By a similar argument, we can prove the upper bound on the isometry property as stated in the corollary, n 
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J.2 Proof of Corollary 23 


Recall the following 

A := X^X - tT{B)Im = + B^I'^Z^) - tr(S)/^ 

= {zJb^/^ + A^/^Zf) {ZiA^/^ + B^/^Z2) - tT{B)Im 

= {zJB^/‘^ZiA^/^ + A^I'^zIbAI'^Z^) + A^I'^Z'lZiA^I"^ + {zJBZ 2 - tr(B)/^). 

Hence, for all vectors u G n E 


u^{X^X)u u^ix{B)Iu 


f 


f 


— uAAu < — \u^Z 2 BZ 2 U — iT{B)uAu\ 


+ 


-^u^aA‘^zJz^aA^u - u^Au + j \JaA^zJbA^Z2U 


By Lemma 35, we have on event Bi, 

\u^Z'^BZu-tv{B)\ < 4Cetr(H); 

By Lemma 36, we have on event B 2 , 

vA^A^I‘^ zIb^I‘^Z 2 U < 4Cetr(H) A^'^u . 

For all u G n E, 


SCetb 


aA\ J\\B\\l/^ <2{2CeA^-^){2eA^ A^^ J 


LB 


< 4C^e^-+4e 

o 


aA\ 


< AC^stb + 4e 


And finally, we have also shown that for all u G -B on event Bg, 

1 


( 1 - 6 ) 


aA\ 


2 ^ V7 




u 


<(1 + ^) 


aA\ 


^ 1/2 


u 


Thus we have for all u G S”^ ^ n B", on event Hi n 132 n Bg, 


vA{X'^X)u uAii{B)Iu rp 

— - -f - 


1 


< — \uAZ2 BZ2U — iT{B)u^u\ 


+ 




u 


^ 1/2 


^ f 


pT aA 12 12 Z 2 U 


< ACstb + 6e 

< ACetb + 6e 


AA^u 

aA^u 


+ SCetb 


yll/2 


u 


/\\B 


11/2 


+ 4C'^erB + 46 




< 106 


Ai/2, 


+ 4(672+ C')6rB. 


(109) 
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Upper bound: Thus we have by (109) for the maximum sparse eigenvalue of A at order k'. 


Pmax(^) 


< 


max 

u^Au 

< max 

u^Au — u^Au 

ueEns^-^ 


liSEnS"*-! 



T Pmax(^) -^) 


Pmax(^) ^)(1 + lOe) + C^STb 


where C4 = 4(C + C^). The upper bound on pmax{k, A — A) m the theorem statement thus holds. 


Lower bound: Suppose 6*4 = 4(C + C^) V 10 


e < 7^ min -, ^ = TT ^ ^ 

L4 V / L5 \ tb 


We have by (109) for all u G S"^ n on event n ^2 n Bg, 

1 T(vTv\ 

—U (A X)u — U - j - U 


> u^Au — Au + 4CerB + SCetb A^^'^u 


\B 


11/2 


> v^Au — deu^Au — ^Cetb — SCer 


1/2 


B 


yll/2 


u 


> u^Au — lOeu^Au — 4(0 + C‘^)eTB > Au{l — lOe — 5) 

> Au{l — 26) 


where 4(C + C‘^)eTB < 6prmn{k, A) and lOe <5. □ 


K Proofs of Lemmas 35 and 36 and Corollary 37 

Throughout the following proofs, we denote by r(i?) = Let e < A where C is large enough so that 


cc 


:'(72 > 4^ hence the choice of C = Cojy/d satisfies our need. 


Proof of Lemma 35. First we prove concentration bounds for all pairs of u, u G 11', where If' C ^ 
is an e-net of E. Let t = CK‘^£tv{B). We have by Lemma 27, and the union bound. 


uGlI', \u^BZv — BZv\ > t) 

( 

—cmm 


< 2 In'I exp 


< 2 I n' I exp 


t 




< 2exp (—C2e^r(n)/iT'^) 


where we use the fact that ||n||p, < ||n||2 tr(n), and 


m 


In'l < ( )(3/e)'' < exp(A;log(3em//ce)) 
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while 


• / ^2 CK‘^\ 2^(-B) „2 2 tr(i?) 2,1 / 3 em^ ^ ,‘iem. 

cmin ( C2, = cCV ,,^,; > cC^Hog (^) > 4/clog (^) 


\BhK^ 


ke 


ke 


Denote by B 2 the event such that for A := {Z^BZ - I), 


sup In'^Aul < Ce =: r'^i. 

u^v£li' 

holds. A standard approximation argument shows that under B 2 and for e < 1/2, 


sup |y^Ax| < ,, < ACe. 


\2 - 


x,y&”^-^nE ' ' (1 

The lemma is thus proved. □ 

Proo/of Lemma 36. By Lemma 27, we have for t = Cetr(i?)/ ||i?|| 2 ^^ for C = C^j^fd 


w 




2 U 


> t] < exp —cmin 


< 2exp I —cmin 


• ( CetrjB) \ 

1 KhiiB) ’ a:2 \\B\\^ j 

. (C^e^TB CerB\ 

lin -:—. -^ 




< 2exp —cmin C , 


,2 CK^\ , 2 ^_ ,r.A 


e / 


e^tb/K^ 


Choose an c-net If' C 5”"-^ such that 


n' = IJ n'j where Wj C Pj n 5 ™"^ 

\j\=k 


( 110 ) 


( 111 ) 


is an c-net for P, n S''" ^ and 


In'l < 


m 


(3/e)^ < exp(A: log(3em/te)). 


Similarly, choose e-net 11 of P n S'" ^ of size at most exp(A; log(3em//ce)). By the union bound and 
Lemma 27, and for > 1, 

r( 3 w£U,u£U's.t. w'^zIb^/^Z2U > Cetr(P)/||P||2^^ 

< |n'| |n| 2exp (—cmin /e, C^) e^rs/AT"^) 

< exp (2A;log(3em/A:e)) 2 exp {—cC’^^tbIB'^") 

< 2exp (-C 2 e^rB/Ar^) 

where C is large enough such that cd:= C > A and for e < 


2 /_ ri2\ ^2 tl’(-B) 


cmin [CK^/e, (7^) e 


\B\\2K^ 


> C'k\og{‘i!>em/ke) > 4A:log(3em/te). 
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Denote by T := Z\B^I‘^Z 2 . A standard approximation argument shows that if 


I \ ^ tr(5) 

sup \w Titj < Ce - =: rtj 




B 


an event which we denote by B 2 , then for aWu £ E and w £ F, 


w 




2U 


< 




(1-6)2- 


( 112 ) 


The lemma thus holds for C 2 > C"/2 > 2. □ 


Proof of Corollary 37. Clearly (107) implies that (104) holds for B = I. Clearly (106) holds following 
the analysis of Lemma 35 by setting B = I, while replacing event Bi with B^, which denotes an event such 
that 

sup j\v'^{Z'^Z — I)u\ < Ce 

The rest of the proof follows by replacing E with F everywhere. The corollary thus holds. □ 
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